Skip to Content

JAILBREAKING IN THE CONTEXT OF LLMS

May 4, 2026
Azade Foutouhi

Large Language Models (LLMs) are increasingly deployed in high‑impact domains, from software engineering and healthcare to decision support and autonomous agents. To mitigate misuse, these models undergo alignment: a collection of techniques intended to ensure that model behavior conforms to human values, legal constraints, and safety policies.

LLM alignment aims to reduce the gap between what a model can generate and what it should generate. Alignment methods are commonly divided into outer alignment and inner alignment. Outer alignment focuses on aligning model outputs with explicit human preferences and policies, typically using techniques such as Reinforcement Learning from Human Feedback (RLHF) or supervised fine‑tuning. Inner alignment, by contrast, concerns whether the model’s internal objectives truly reflect those intended constraints.

Jailbreaking refers to deliberate attempts to bypass alignment constraints and elicit restricted behaviors.  As alignment mechanisms become increasingly sophisticated, jailbreaking techniques likewise evolve, developing more advanced strategies to circumvent these safeguards.

This tension between alignment and jailbreaking has evolved into a continuous arms race, exposing fundamental limitations in how safety is currently enforced in LLMs.

Early jailbreaks relied on manually crafted prompts or role‑playing scenarios. However, research after 2024 demonstrates that alignment defenses based primarily on refusal patterns and content filters are systematically exploitable, especially when adversaries manipulate context, intent framing, or interaction dynamics.  (Figure 1 illustrates an example of jailbreaking by prompt crafting)

Figure 1-prompts that jailbreak the target model [1]

Recent work highlights a shift from static prompt tricks to systematic and automated attack strategies including:

1. Multi‑turn and iterative jailbreaking

2. Composite and hybrid attacks

3. Distribution‑shift and calibration‑based attacks

4. Out‑of‑distribution (OOD) transformations

This blog is mainly based on the following references:

About the author

ScientificLeader | France
PhD in autonomous navigation from UNSW; led AI‑based IoT behavior prediction and drone navigation projects at Capgemini. Now Scientific Leader at SogetiLabs, driving healthcare AI research, proposing solutions, monitoring progress, and building internal and external partnerships.

Leave a Reply

Your email address will not be published. Required fields are marked *

Slide to submit