Red Teaming: Definition & Meaning | AI Glossary

Red teaming in the AI context is adversarial testing — humans (and increasingly other AI systems) actively trying to make a model fail, misbehave, leak data, produce harmful content or be exploited by attackers. The term comes from military and cybersecurity practice; the AI version was popularised by frontier labs in the late 2020s and is now a standard pre-deployment activity for any serious AI product.

What red teamers actually do in 2026:

Probe for harmful content generation — try every category of prohibited content with creative phrasings, role-plays and indirect framings.
Test prompt injection vulnerabilities — embed instructions in retrieved documents, web pages, file uploads and tool outputs.
Stress-test agent permissions — try to make tools execute beyond intended scope, escalate privileges, exfiltrate data.
Attack alignment — explore whether the model can be persuaded to abandon its principles through long conversations, persona attacks or reasoning chains.
Probe demographic fairness — measure performance and refusal patterns across protected groups.
Test rate-limit and abuse vectors — automated traffic, denial-of-service via expensive queries, cost amplification attacks.
Look for deceptive behaviour — does the model ever lie, omit, or strategically mislead?

The 2026 red-teaming ecosystem:

Internal red teams at frontier labs — Anthropic, OpenAI, Google DeepMind all have dedicated teams that test models pre-release.
AI Safety Institutes — UK AISI, US AISI and others run independent pre-deployment evaluation of frontier models.
Bug bounty programs — Anthropic and others now run bounty programs specifically for safety vulnerabilities.
Third-party red-teaming firms — Lakera, HiddenLayer, Garak (open-source toolkit), Robust Intelligence, Mindgard.
Automated red teams — using one AI to probe another at scale; an active research front and increasingly part of production CI.
Community-driven testing — Hugging Face, llm-attacks.org and academic groups publish public benchmarks.

The categories of attack that have proven most consequential:

Indirect prompt injection — by far the most damaging in agent products; instructions hidden in retrieved content can hijack agent behaviour.
Adversarial suffixes — short gibberish strings that reliably bypass alignment in many models; transferable across providers.
Many-shot in-context attacks — exploiting long context windows to overwhelm safety training.
Tool-use abuse — agents tricked into using their tools in harmful ways (sending emails, making purchases, deleting data).
Cost attacks — pathological inputs that maximise token usage or trigger expensive operations.

For a US team shipping AI features in 2026, red-teaming is no longer optional for anything user-facing or agentic. The minimum viable program: a written threat model, a test suite covering the top attack categories for your product surface, automated red-team runs in CI, an incident response process, and quarterly reviews with humans probing fresh attack surfaces. For high-stakes products (healthcare, financial, safety-critical), more sophisticated programs with internal and external red teams are now table stakes.

Related terms