seniorFeature Engineering
What is data leakage in feature engineering pipelines?
Updated May 16, 2026
Short answer
Data leakage occurs when information from outside the training dataset is used to create features.
Deep explanation
Leakage can happen during preprocessing, feature creation, or target encoding. It leads to overly optimistic model performance and poor generalization. Proper train-test splitting and pipeline design are essential.
Real-world example
In stock prediction, using future prices as features causes leakage.
Common mistakes
- Fitting scalers or encoders on full dataset before splitting.
Follow-up questions
- How to prevent leakage?
- Why is leakage dangerous?