* Author Note: As this article is academic in nature, I’ve inserted the initials of the researchers who wrote the papers I am referencing in this post. At the end of this post, you will find the full information about the papers and researchers referenced.
In this post we’ll look at data leakage. It is considered “one of the top ten data mining mistakes” ([NEM[JL1] ]) and capable of making a predictive model useless, so the importance of being aware of it cannot be overstated.
In the words of [KRP], whom I’ll quote often in this post: “Leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from[JL2] ”. A trivial example is a model that uses the target it needs to predict for the prediction.
Leakage happens unintentionally in the process of data collection and preparation. It leads the modeler, at the very least, to a suboptimal model and, in the worst case, to a totally useless one.
Leakage should be suspected when the performance of the model is significantly better than expected (based on past experience or public results). For example, a model that always accurately predicts tomorrow's stock prices is highly implausible.
Once detected, it might be non-trivial to eliminate, and sometimes, while trying to eliminate the source of the leakage, one introduces new leakage. Indeed, [KP] describes how removing the “total purchase in jewelry” attribute ended up being predictive for the task of recognizing a heavy spender.
[KRP] distinguishes between two sources of leakage: in features and in training examples.
Leakage in features means that the model is exposed to information (in the form of a vector of features) that in reality, it does not have access to. For example, using an account number of a candidate customer for the task of predicting whether she converts. Obviously, having an account number means that the candidate did convert.
Leakage in training examples is the presence of information (be it in the form of features or targets) in the training set that is not legitimate for the sake of predicting the targets of the instances in the dataset kept for evaluation of the model. This, for example, happens if there is an overlap between the two datasets.
In the industry, a significant source for leakage is the way databases are usually updated, namely without a time signature. This makes it possible that features engineered from the data are from a later time than that of the target.
In [KRP], various examples of leakages in competitions (e.g., KDD-Cup and IJCNN) are described, the last being from 2011 (the year the paper was published). I’d like to add two more examples, both are from the 2018 Kaggle challenges:
To summarize, leakage is a serious problem in data science and the many examples for its occurrence, even in highly professionally organized and supervised competitions, prove that it is very hard (maybe impossible) to completely avoid.
[NEM] Handbook of Statistical Analysis and Data Mining Applications by R. Nisbet, J. Elder and Gary Miner.
[KP] Ten Supplementary Analyses to Improve E-commerce Web Sites by R. Kohavi and R. Parekh.
[KRP] Leakage in Data Mining: Formulation, Detection, and Avoidance by S. Kaufman, S. Rosset and C. Perlich.