“Prediction is very difficult, especially if it’s about the future.”
― Niels Bohr, physicist
Holdout samples are a key component to estimating a “useful” forecasting model. Set aside data at least equal in length to your forecast horizon (“holdout sample”). Build your models on the remaining data (“modeling sample”). And compare the candidate models’ forecast performance over the holdout sample.
At a minimum, a single holdout sample should be used.
But to get a better sense of a model’s future performance, consider using multiple holdout samples.
This guards against basing your model on a holdout sample that is unrepresentative of the overall characteristics of the time series.
One way to achieve this is to use “rolling” holdout samples.
A rolling analysis of a time series is generally used to test a model’s stability. That is, are a model’s parameters stable across time or do they change, especially in a systematic way?
This is important for a forecasting model. We don’t want a forecasting model whose parameters are changing during the forecast horizon in an unexpected (i.e. unmodeled) manner.
Suppose our forecast horizon is 6 months.
Under a single holdout sample, we would set aside the last 6 months of data as the holdout sample. Then using the remaining data as the modeling sample, estimate models, forecast over the single holdout sample and compare the models’ performance.
This will help narrow down the pool of candidate models.
Rolling holdout samples
But under a rolling holdout approach, also called “time series cross-validation,” we would set aside a longer sample of data, say, the last 12 months. Then:
Step 1: Estimate a model and forecast over the first 6-months of this 12-month period (“roll 1”);
Step 2: Then add one 1 month to the tail-end of the estimation sample, recalibrate the model, and forecast over the subsequent 6-months (“roll 2”);
Step 3: Then add another month to the estimation sample, recalibrate and forecast over the subsequent 6-months (“roll 3”);
Step 4: Repeat until there are no more 6-month periods (“rolls”) remaining in the 12-month period.
So, in this example, we would have recalibrated our model 7 times (each with a modeling sample that is one additional month longer than the previous). And we would have made 7 forecasts over the rolling holdout periods.
The last “roll,” it turns out, is the same 6-month period we would have used under a single 6-month holdout sample case. So, we generate the stats for a standard single holdout sample during the course of this rolling holdout approach.
If we are examining multiple candidate models, this process can generate a lot of data. Below is an example of the rolling forecasts for one model.
Summary roll statistics
For each roll forecast, we can calculate the MAPE and MPE and observe how they change across the rolling forecasts.
Are the MAPE and MPE constant? Fluctuate with no apparent trend? Or exhibit some systematic trend?
Doing this for every candidate model we are testing generates charts like this which can quickly show any areas of concern:
In this example, candidate models 18 and 15 may be worth further inspection since their MAPEs are much higher than the rest in a recent roll period (roll 6).
What else makes a model useful?
So, with respect to the guidelines for whittling down a pool of candidate models we listed in an earlier article, we can add the following from a rolling holdout analysis:
Stability – The model’s parameters should retain their statistical significance and not vary too much across the rolling periods; and the model’s residuals should remain “white noise” across the rolls;
Consistency of Performance – The model’s forecast accuracy and bias should not exhibit any strong trends, especially trends in the “wrong” direction (i.e. getting progressively worse) as the more recent time period is approached.
Strong Rolling Holdout Sample Performance – The model’s forecast accuracy and bias, averaged across all the rolls, should be high and low respectively. That is both the average MAPE and MPE should be low.
Benefits of Rolling
The primary benefit of a rolling analysis is that we get to see how a model performs forecast-wise over multiple time spans equal in length to our forecast horizon; instead of relying on performance in just one holdout sample.
A rolling analysis also addresses the issue of a short holdout sample (e.g. short forecast horizon) possibly not being representative of the general character of the time series.
In addition, a rolling analysis can be used as a check for the “best” model chosen using a single holdout sample. That is, would you pick the same model using the rolling holdout approach? If not, why?
In sum, a model that is persistently better at holdout sample forecasting over a longer time frame is likely to be more robust.
So, let ‘em roll!