Time Series Cross-Validation

This package is a Scikit-Learn extension.

Motivation

Cross-validation may be one of the most critical concepts in machine learning. Although the well-known K-Fold or its base component, train-test split, serves well in i.i.d. cases, it can be problematic in time series, which manifest temporal dependence. The following is a simple example where train-test split fails.

Let us consider the following time series:

\(t\)

1

2

3

4

5

\(x_t\)

0

1

0

1

0

\(y_t\)

1

0

1

0

1

If we split the sample into the training set \(\{(x_t, y_t), 1 \le t \le 4\}\) and the test set \(\{(x_5, y_5)\}\), we may infer from the training set that

\[y_t = 1 - x_t,\]

which then gets confirmed on the test set.

This seemingly innocuous example relies on a questionable assumption: \(\{(x_5, y_5)\}\) constitutes a legitimate test set. That is, it should be independent to the training set. Otherwise, the so-called data leakage happens.

To see how detrimental it can be, let us suppose that the above time series is a subset of a longer time series:

\(t\)

1

2

3

4

5

6

7

8

9

\(x_t\)

0

1

0

1

0

0

0

1

1

\(y_t\)

1

0

1

0

1

0

0

0

1

Although the rule \(y_t = 1 - x_t\) is valid among the first 5 data points, the next 4 data points are a different story. In fact, the above time series are generated by the ground truth

\[y_t = x_{t-1}.\]

Now let us analyze why the traditional cross-validation fails in this example. The ground truth indicates that \(y_5\) is determined by \(x_4\) or, in other words, the test set can be inferred from the training set. In addition, \(x_4\) and \(x_5\) may be related in some unspoken ways.

This kind of collusion pretty much deviates from the original goal of train-test split: if the witness has conflict of interest with the suspected, how can he make an objective testimony? Ignorance of temporal dependence may lead to false conclusion.

To mitigate this conflict of interest or temporal dependence, one easy and sweet solution is to introduce gaps between the training set and the test set. For instance, in the above example, we can use \(\{(x_6, y_6)\}\) as the test set, instead of \(\{(x_5, y_5)\}\), which now serves as a gap.

The gap is like a Chinese wall, which works against data leakage. The thicker the wall is, the less train-test dependence there will be, given that temporal dependence decays with distance.

Of course, this gap does not automatically lead you to the ground truth. It nonetheless helps you avoid some pitfalls and, as a consequence, makes you closer to the truth.

Components

This package provides such tools that help you introduce gaps between train-test splits as well as within each fold of cross-validation. In particular, it implements the following 3 classes and 1 function:

The three classes can all be passed, as the cv argument, to scikit-learn functions such as cross_validate cross_val_score cross_val_predict, just like the native cross-validator classes.

The one function complements the train_test_split function in scikit-learn.

These tools help you handle temporal dependence without any extra coding overhead.