data leakage

Data leakage occurs when a machine learning model is trained with information about the target/outcome variable that it will not have when used in production. This typically occurs when a feature is included in the training dataset inappropriately. For example, if you want to predict whether a website visitor will purchase a product using their behavioral and demographic details, but accidentally include a feature reflecting their purchases in the training dataset, the model will “know” information about the visitor’s future that will not be available when you use the model to make predictions about a new visitor. Additionally, the model will seem to perform unusually well because it has been provided information that directly correlates strongly with the target/outcome variable.