Cross-validation is a technique for measuring how well a model will perform on data it has never seen. Instead of training on one slice and testing on another, you rotate which slice is held out, train multiple models, and average the results. The most common form is k-fold cross-validation — split the data into k equal slices (typically 5 or 10), train k separate models each holding out a different slice, and report the mean test score.
The reason cross-validation matters is that any single train/test split can be misleading. You might get lucky and pick an easy test set, or unlucky and pick a hard one. Cross-validation gives you a more reliable estimate of true performance plus a sense of variance — high variance across folds is a warning sign that the model is unstable or the dataset is too small.
In modern deep learning the picture is messier. Cross-validation works beautifully for tabular models with a few thousand rows and a fast training loop. It is impractical for a large language model that takes a month to train, where you simply cannot afford ten training runs. For LLMs the discipline is different: a single large held-out evaluation set, rigorous benchmark suites (MMLU, GSM8K, HumanEval), and red-team probes for safety and reliability.
Variants worth knowing:
- Stratified k-fold — preserves the class distribution in every fold; essential for imbalanced classification.
- Leave-one-out (LOOCV) — k equals the dataset size; gold-standard for very small datasets, ruinous for large ones.
- Time-series CV — folds respect chronological order so you never train on future data; mandatory for forecasting.
- Group k-fold — keeps related rows (same user, same patient) in the same fold so the model cannot memorise identities.
For a US data scientist, the practical rule is simple: if your dataset has fewer than 100,000 examples, cross-validation is your friend. Above that, a single well-curated holdout is usually enough and the compute saving matters. Either way, the metric you report should be the one your business actually cares about — accuracy is a lazy default, F1 / AUC / business-weighted cost are usually more honest.