A dataset is a structured collection of examples used to train, validate or evaluate a machine-learning model. In supervised learning each example is a pair — input and correct answer. In language modelling each example is just a chunk of text the model is asked to predict next-token by next-token. In reinforcement learning, the dataset is generated on the fly as the agent acts in an environment.
The dataset is almost always the deciding factor in a model's quality. A modest architecture trained on excellent data routinely beats a frontier architecture trained on web sludge. For LLMs in 2026, the most valuable datasets are not the biggest — they are the cleanest, the most diverse and the most carefully filtered. Anthropic, OpenAI and DeepMind have spent years building proprietary data pipelines and treat them as core IP.
Common dataset categories you will hear about:
- Pretraining corpora — Common Crawl, Wikipedia, GitHub code, books, scientific papers. Trillions of tokens.
- Instruction datasets — example prompt-and-response pairs used in supervised fine-tuning to teach models how to follow instructions.
- Preference datasets — pairs of responses ranked by humans or another AI, used in RLHF.
- Evaluation benchmarks — MMLU, GSM8K, HumanEval, SWE-Bench, GPQA. Standardised tests that measure capability.
- Domain datasets — your company's support tickets, your codebase, your legal filings. The proprietary data that makes a tuned model uniquely useful.
For a US business adopting AI, the right question is not "do we have enough data?" — it is usually "do we have clean data with consistent formatting, accurate labels and clear ownership?" A team with 5,000 well-labelled examples almost always wins against a team with 5 million dirty ones.
Building a dataset is a craft. You define the schema, sample the rows, write the labelling guidelines, recruit annotators, measure inter-rater agreement, audit edge cases, version everything, and document the provenance. Tools like Argilla, Label Studio, and Scale AI exist because every step has to be done well or the model that learns from the result will inherit the flaws.
The legal and ethical dimension is non-trivial: scraping copyrighted material, using user-generated content without consent, or training on PII can expose a company to significant risk. Treat the dataset as a product, not a side effect of training.