Top 10 ways your Machine Learning models may have leakage
Rayid Ghani, Joe Walsh, Joan Wang
If you’ve ever worked on a real-world machine learning problem, you’ve probably introduced (and hopefully discovered and fixed) leakage into your system at some point. Leakage is when your model has access to data at training/building time that it wouldn’t have at test/deployment/prediction time. The result is an overoptimistic model that performs much worse when deployed.
The most common forms of leakage happen because of temporal issues – including data from the future in your model because you have that when you’re doing model selection but there are many other ways leakage gets introduced. Here are the most common ones we’ve found working on different real-world problems over the last few years. Hopefully, people will find this useful, add to it, and more importantly, start creating the equivalent of “unit tests” that can detect them before these systems get deployed (see initial work by Joe Walsh and Joan Wang).
The Big (and obvious) One
1. Using a proxy for the outcome variable (label) as a feature. This one is often easy to detect because you get perfect performance but is more nuanced when the proxy is some approximation of the label/outcome variable and the performance increase is more subtle to detect easily.
Doing any transformation or inference using the entire dataset
2. Using the entire data set for Imputations. Always do imputation based on your training set only, for each training set. Including the test set allows information to leak in to your models, especially in cases where the world changes in the future (when does it not?!)
3. Using the entire data set for discretizations or normalizations/scaling or many other data-based transformations. Same reason as #2. The range of a variable (age for example) can change in the future and knowing that will make your models do/look better than they actually are.
4. Using the entire data set for Feature Selection. Same reasons as #2 and #3. To play it safe, first split into train and test sets, and then do everything you need to do using that data.
Using information from the future (that will not available at training or prediction time)
5. Using (proxies/transformation of) future outcomes as features: Similar to #1
6. Doing standard k fold cross-validation when you have temporal data. If you have temporal data (that is non-stationary – again, when is it not!), k-fold cross validation will shuffle the data and a training set will (probably) contain data from the future and a test set will (probably) contain data from the past.
7. Using data (as features) that happened before model training time but is not available until later. This is fairly common in cases where there is lag/delay in data collection or access. An event may happen today but it doesn’t appear in the database until a week, a month, or a year later and while it will be available in the data set you’re using to build and select ML models, it will not be available at prediction time in deployment.
8. Using data (as rows) in the training set based on information from the future. Including rows that match certain criteria (in the future) in the training set, such as everyone who got a social service in the next 3 months) leaks information to your model via a biased training set.
Humans using knowledge from the future
9. Selecting certain models, features, and other design choices that are based on humans (ML developers, domain experts) knowing what happened in the future. This is a gray area – we do want to use all of our domain knowledge to build more effective systems but sometimes that may not generalize into the future and result in overfitted/over-optimistic models at training time and disappointment once they’re deployed.
10. That’s where you come in. What are your favorite leakage stories or examples?
Some useful references:
I think you probably cover this leakage in your post. In clinical machine learning, the common data leakage is multiple records of a unique patient appearing in both training and testing dataset. Assume I need to identify the patient’s health risk using the X-ray images. Some patients will have more than one X-ray image. When I split train/test dataset, it is desirable to have all the X-ray images of each patient appear only in a train dataset or test dataset to prevent the data leakage.