Labeling: Definition & Meaning | AI Glossary

Labeling (also called annotation) is the process of attaching the correct answer to each example in a dataset. A model trained with supervised learning is essentially memorising and generalising the labelling decisions humans made. Quality, consistency and coverage of labels determine whether the resulting model is brilliant or worse than guessing.

Labels look different depending on the task. For sentiment classification a label might be a single word ("positive"). For object detection it might be a bounding box around every car in an image. For LLM instruction tuning it might be a multi-paragraph ideal response. For RLHF it might be a ranking — "response A is better than response B" — without specifying the right answer at all.

The labelling pipeline that ships in serious AI shops in 2026 has several stages:

Guidelines — a written spec describing exactly what counts as the correct answer, including edge cases and counterexamples.
Pilot — a small batch labelled by multiple annotators to measure inter-rater agreement; if humans cannot agree, the task is broken.
Production labelling — full-scale annotation, often by a vendor like Scale AI, Surge, Toloka or Mechanical Turk, sometimes by in-house experts.
Audit — a sampled review by senior annotators or domain experts catching drift and ambiguous cases.
Synthetic labelling — using a strong LLM (often GPT-5 or Claude Sonnet 4) to label data, then having humans verify a subset. Cheap, fast and increasingly the default.

The cost of labelling can dominate a project budget. A team labelling medical imaging at $5 per image, hundreds of thousands of images, runs into seven figures fast. The economic case for synthetic data plus targeted human verification has only strengthened as models get better — and so have the active-learning pipelines that pick the most informative examples to send to humans.

For a US team adopting AI, the practical advice is: invest in the labelling guidelines as if they were product specifications. Every ambiguity in the guidelines becomes a confusion in the model. Spend a week getting the spec right and you will save months of debugging downstream.

Related terms