Infrastructure & Ethics

Red Teaming

Adversarial testing of AI systems by humans (or other AI) trying to make them fail or misbehave.

In common use since 2022

Red teaming in the AI context is adversarial testing — humans (and increasingly other AI systems) actively trying to make a model fail, misbehave, leak data, produce harmful content or be exploited by attackers. The term comes from military and cybersecurity practice; the AI version was popularised by frontier labs in the late 2020s and is now a standard pre-deployment activity for any serious AI product.

What red teamers actually do in 2026:

  • Probe for harmful content generation — try every category of prohibited content with creative phrasings, role-plays and indirect framings.
  • Test prompt injection vulnerabilities — embed instructions in retrieved documents, web pages, file uploads and tool outputs.
  • Stress-test agent permissions — try to make tools execute beyond intended scope, escalate privileges, exfiltrate data.
  • Attack alignment — explore whether the model can be persuaded to abandon its principles through long conversations, persona attacks or reasoning chains.
  • Probe demographic fairness — measure performance and refusal patterns across protected groups.
  • Test rate-limit and abuse vectors — automated traffic, denial-of-service via expensive queries, cost amplification attacks.
  • Look for deceptive behaviour — does the model ever lie, omit, or strategically mislead?

The 2026 red-teaming ecosystem:

  • Internal red teams at frontier labs — Anthropic, OpenAI, Google DeepMind all have dedicated teams that test models pre-release.
  • AI Safety Institutes — UK AISI, US AISI and others run independent pre-deployment evaluation of frontier models.
  • Bug bounty programs — Anthropic and others now run bounty programs specifically for safety vulnerabilities.
  • Third-party red-teaming firms — Lakera, HiddenLayer, Garak (open-source toolkit), Robust Intelligence, Mindgard.
  • Automated red teams — using one AI to probe another at scale; an active research front and increasingly part of production CI.
  • Community-driven testing — Hugging Face, llm-attacks.org and academic groups publish public benchmarks.

The categories of attack that have proven most consequential:

  • Indirect prompt injection — by far the most damaging in agent products; instructions hidden in retrieved content can hijack agent behaviour.
  • Adversarial suffixes — short gibberish strings that reliably bypass alignment in many models; transferable across providers.
  • Many-shot in-context attacks — exploiting long context windows to overwhelm safety training.
  • Tool-use abuse — agents tricked into using their tools in harmful ways (sending emails, making purchases, deleting data).
  • Cost attacks — pathological inputs that maximise token usage or trigger expensive operations.

For a US team shipping AI features in 2026, red-teaming is no longer optional for anything user-facing or agentic. The minimum viable program: a written threat model, a test suite covering the top attack categories for your product surface, automated red-team runs in CI, an incident response process, and quarterly reviews with humans probing fresh attack surfaces. For high-stakes products (healthcare, financial, safety-critical), more sophisticated programs with internal and external red teams are now table stakes.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp