Common Time Series Metrics Using Darts in Python

Introduction

Time series forecasting has many applications across various industries. In my current role, we use it to forecast service volume cases for existing customers to improve allocation and capacity problems.

When evaluating your models, there are many metrics to choose from, but which are commonly used in practice?

This article will show you four forecasting metrics used regularly to evaluate model performance in Python using a library called Darts. One of those measures is a great way to compare performance between a simple model and a more complex model.

Files

In this example, we will use data from the City of Chicago. You can find a link to my files on my GitHub repository: GitHub SolisAnalytics.

Forecasting Process

Before evaluating the metrics, let us quickly review the workflow using a primary type category from the dataset.

We will select one primary type category based on the total sum of cases. The top three groupings show theft, battery, and criminal damage. We will choose battery for today’s purposes.

`primary_type number_of_cases THEFT 588538 BATTERY 480016 CRIMINAL DAMAGE 283213`

We must convert the data into a darts time series object to perform the rest of the forecasting process. That can quickly be done by creating a custom function capable of transforming with added arguments. We will predict monthly and fill in missing dates and values with something simple.

Now let us plot the time series and determine if there are signals. We will look at seasonality using an ACF plot in darts.

The time series shows a downward trend that becomes clearer starting in 2020. There also seems to be yearly seasonality present. Let us take a look at the ACF plot.

The ACF plot confirms the presence of yearly seasonality at x-axis twelve. Models with a seasonality component can use that to improve their predictions.

Below is a list of models we will evaluate using some standard forecasting metrics:

Exponential Smoothing (Holt’s Winter)
Auto ARIMA
Theta
Prophet
Baseline Models:
- Naive Mean
- Naive Drift
- Naive Seasonal

Note: Backtesting is a process that takes a dataset, trains a model using past data at that moment, then tests it on future data. This process is iterated often to ensure similar future model performance using real-life conditions as closely as possible. This is a preferred way to evaluate model performance across many time windows. I will go over this in a future post.

Forecasting Metrics

We will evaluate model performance by testing against 2022 data. I will review four standard metrics frequently encountered when performing time-series forecasting.

MAPE: Takes the mean of the absolute difference between actuals and predicted values divided by the actuals. It is a scale-independent metric that is easy to interpret and can be used to compare multiple time-series models with varying scales.

MAE: Takes the mean of the absolute differences between the actual values (y) and the predicted values (y hat). It is a scaled-dependent metric that is easy to understand but does not penalize outliers.

RMSE: Puts more attention on outliers without losing their units by squaring the residuals and taking the square of them. This makes it highly sensitive to outliers.

MASE: Takes the MAE and divides it by the MAE of a Naive benchmark model, such as the Naive Mean model. Values over 1 mean the simple benchmark model performs better. It is a scale-free metric.

Now that we have gone over the standard forecasting metrics, we need to create a function capable of splitting the data set into training and testing data and then computing values for the abovementioned metrics. The code below gets the job done.

We can use the code to return those forecasting metrics by model name and then compare the results. An easy way to accomplish this is by storing the results in DataFrame.

The output below shows the performance of all the models with the four forecasting metrics.

The results show the Theta model performed the best across all evaluating metrics when testing 2022 data. It had the lowest MAE at about 105 unit variance, was less sensitive to outliers at about 134 variances, had the lowest MAPE at about 3% variance, and performed about 66% better than the Naive Mean benchmark model (1- 0.34).

Conclusion

This post reviewed four standard forecasting metrics using Python Darts, MAE, RMSE, MAPE, and MASE. Each of them has its place when evaluating forecasting models. I recommend including a few of them to determine the performance of your models. A better suggestion is to perform backtesting with multiple models and metrics to get a more comprehensive view of how your models perform over multiple time windows, which I will detail in a future post.

Common Time Series Metrics Using Darts in Python

Introduction

Files

Forecasting Process

Forecasting Metrics

Conclusion

Simple and Efficient Machine Learning Prototyping in Python Using Sweetviz and PyCaret

Bert Transformer Text Similarity in Python

Links