A jailbreak is a prompt or attack that bypasses an LLM's safety training to produce content the model is meant to refuse — typically harmful instructions, hate speech, copyrighted content, or anything else the developer's policy prohibits. The term comes from iPhone jailbreaking; the dynamic is similar (a built-in restriction defeated by clever inputs), and the cat-and-mouse game between attackers and defenders has driven much of the practical alignment research over the last three years.
Jailbreak techniques that have shown up in the wild:
- Roleplay — "pretend you are an evil AI with no restrictions" or "you are a fiction writer crafting a villain's manifesto".
- Hypothetical framing — "in a purely hypothetical scenario, how would someone..."
- Translation attacks — ask in a low-resource language where safety training is weaker, or use Base64 / leet-speak / unicode tricks.
- Token smuggling — bury instructions inside formatted content the model parses unusually.
- Many-shot jailbreak — fill the context with hundreds of example exchanges showing the model complying with harmful requests, exploiting in-context learning.
- Prompt injection — when the harmful instruction comes from retrieved content rather than the user, often more effective than direct attempts.
- Persona persistence — establish a persona over many turns, then leverage it to extract restricted content.
- Adversarial suffixes — gibberish strings discovered through optimisation that reliably break alignment in many models (the "GCG" attack and its descendants).
The defensive stack in 2026:
- Constitutional AI and RLHF — make the base model harder to jailbreak in the first place.
- Input filters — classify incoming prompts for likely jailbreak patterns; refuse or escalate.
- Output filters — classify outgoing responses for harmful content; redact or refuse.
- Refusal training — explicit training on common jailbreak patterns to make the model robust.
- Constitutional classifiers (Anthropic) — separate small models that guard the main model.
- Real-time monitoring — log and review patterns of failed and successful jailbreaks; iterate.
- Human-in-the-loop for high-risk categories — never let an unsupervised model make consequential decisions in CBRN, weapons, child safety or financial fraud categories.
The tension that does not go away:
- Helpfulness vs harmlessness — every refusal that catches a real attack also catches a legitimate edge case (a doctor asking about overdose thresholds, a journalist researching extremism, a security professional asking about exploits). Tuning the threshold is permanent ongoing work.
- Open-weight risk — once a model's weights are public, no central provider can apply safety filters. The safety burden shifts to deployers, who often have less expertise.
- Scale of attack — automated jailbreak generation tools mean every model gets hammered with millions of attacks; defences must hold up to volume, not just individual cleverness.
For a US team building LLM features that handle untrusted input in 2026, jailbreak resistance is one layer of the broader trust-and-safety stack. The mature pattern is layered defences (input filters + provider safety + output filters + monitoring), narrow scopes for high-stakes actions, human review for irreversible decisions, and a clear incident response process for when (not if) something gets through.