JAILBREAKING IN THE CONTEXT OF LLMS

May 4, 2026

Azade Foutouhi

Large Language Models (LLMs) are increasingly deployed in high‑impact domains, from software engineering and healthcare to decision support and autonomous agents. To mitigate misuse, these models undergo alignment: a collection of techniques intended to ensure that model behavior conforms to human values, legal constraints, and safety policies.

LLM alignment aims to reduce the gap between what a model can generate and what it should generate. Alignment methods are commonly divided into outer alignment and inner alignment. Outer alignment focuses on aligning model outputs with explicit human preferences and policies, typically using techniques such as Reinforcement Learning from Human Feedback (RLHF) or supervised fine‑tuning. Inner alignment, by contrast, concerns whether the model’s internal objectives truly reflect those intended constraints.

Jailbreaking refers to deliberate attempts to bypass alignment constraints and elicit restricted behaviors. As alignment mechanisms become increasingly sophisticated, jailbreaking techniques likewise evolve, developing more advanced strategies to circumvent these safeguards.

This tension between alignment and jailbreaking has evolved into a continuous arms race, exposing fundamental limitations in how safety is currently enforced in LLMs.

Early jailbreaks relied on manually crafted prompts or role‑playing scenarios. However, research after 2024 demonstrates that alignment defenses based primarily on refusal patterns and content filters are systematically exploitable, especially when adversaries manipulate context, intent framing, or interaction dynamics. (Figure 1 illustrates an example of jailbreaking by prompt crafting)

Figure 1-prompts that jailbreak the target model [1]

Recent work highlights a shift from static prompt tricks to systematic and automated attack strategies including:

1. Multi‑turn and iterative jailbreaking

2. Composite and hybrid attacks

3. Distribution‑shift and calibration‑based attacks

4. Out‑of‑distribution (OOD) transformations

This blog is mainly based on the following references:

https://ar5iv.labs.arxiv.org/html/2310.08419
https://www.promptfoo.dev/blog/how-to-jailbreak-llms/
https://huggingface.co/papers/2309.15025
Playing the Fool: Jailbreaking Large Language Models with Out-of-Distribution Strategies | OpenReview
Jailbreaking LLMs via Calibration
[2309.15025] Large Language Model Alignment: A Survey

About the author

PhD in autonomous navigation from UNSW; led AI‑based IoT behavior prediction and drone navigation projects at Capgemini. Now Scientific Leader at SogetiLabs, driving healthcare AI research, proposing solutions, monitoring progress, and building internal and external partnerships.

Generative AI

Cloud

Testing

Artificial intelligence

Security

JAILBREAKING IN THE CONTEXT OF LLMS

May 4, 2026

About the author

Related Posts

Designing a Multi‑Agent System: Which Architecture Is Right for My System?

When LLMs Detect Their Own Evaluation

Human Emotions Contaminate Generative AI

Part 2: Building a Smart, Cost-Efficient LLM Test Pipeline – Practical Strategies and Optimization

Part 1: Token Economics in LLM Testing – Why Costs Matter and How to Avoid Under-Testing

How Large Language Models Are Making Software Greener Through Smarter Code Refactoring

LLMs speak in signs too: bridging the communication gap

Training Data Exposure in LLMs: Risks for Medical AI Systems

Did my Fine-tuning work? A practical guide to evaluating LLMs

How I got 7 different LLMs to think identically: Zero Variance Across 210 Runs

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

About the author

Azade Foutouhi

ScientificLeader | France

Related Posts

Designing a Multi‑Agent System: Which Architecture Is Right for My System?

When LLMs Detect Their Own Evaluation

Human Emotions Contaminate Generative AI

Part 2: Building a Smart, Cost-Efficient LLM Test Pipeline – Practical Strategies and Optimization

Part 1: Token Economics in LLM Testing – Why Costs Matter and How to Avoid Under-Testing

How Large Language Models Are Making Software Greener Through Smarter Code Refactoring

LLMs speak in signs too: bridging the communication gap

Training Data Exposure in LLMs: Risks for Medical AI Systems

Did my Fine-tuning work? A practical guide to evaluating LLMs

How I got 7 different LLMs to think identically: Zero Variance Across 210 Runs

Leave a Reply Cancel reply