Red teaming in the AI context is adversarial testing — humans (and increasingly other AI systems) actively trying to make a model fail, misbehave, leak data, produce harmful content or be exploited by attackers. The term comes from military and cybersecurity practice; the AI version was popularised by frontier labs in the late 2020s and is now a standard pre-deployment activity for any serious AI product.
What red teamers actually do in 2026:
- Probe for harmful content generation — try every category of prohibited content with creative phrasings, role-plays and indirect framings.
- Test prompt injection vulnerabilities — embed instructions in retrieved documents, web pages, file uploads and tool outputs.
- Stress-test agent permissions — try to make tools execute beyond intended scope, escalate privileges, exfiltrate data.
- Attack alignment — explore whether the model can be persuaded to abandon its principles through long conversations, persona attacks or reasoning chains.
- Probe demographic fairness — measure performance and refusal patterns across protected groups.
- Test rate-limit and abuse vectors — automated traffic, denial-of-service via expensive queries, cost amplification attacks.
- Look for deceptive behaviour — does the model ever lie, omit, or strategically mislead?
The 2026 red-teaming ecosystem:
- Internal red teams at frontier labs — Anthropic, OpenAI, Google DeepMind all have dedicated teams that test models pre-release.
- AI Safety Institutes — UK AISI, US AISI and others run independent pre-deployment evaluation of frontier models.
- Bug bounty programs — Anthropic and others now run bounty programs specifically for safety vulnerabilities.
- Third-party red-teaming firms — Lakera, HiddenLayer, Garak (open-source toolkit), Robust Intelligence, Mindgard.
- Automated red teams — using one AI to probe another at scale; an active research front and increasingly part of production CI.
- Community-driven testing — Hugging Face, llm-attacks.org and academic groups publish public benchmarks.
The categories of attack that have proven most consequential:
- Indirect prompt injection — by far the most damaging in agent products; instructions hidden in retrieved content can hijack agent behaviour.
- Adversarial suffixes — short gibberish strings that reliably bypass alignment in many models; transferable across providers.
- Many-shot in-context attacks — exploiting long context windows to overwhelm safety training.
- Tool-use abuse — agents tricked into using their tools in harmful ways (sending emails, making purchases, deleting data).
- Cost attacks — pathological inputs that maximise token usage or trigger expensive operations.
For a US team shipping AI features in 2026, red-teaming is no longer optional for anything user-facing or agentic. The minimum viable program: a written threat model, a test suite covering the top attack categories for your product surface, automated red-team runs in CI, an incident response process, and quarterly reviews with humans probing fresh attack surfaces. For high-stakes products (healthcare, financial, safety-critical), more sophisticated programs with internal and external red teams are now table stakes.