This article builds on our earlier posts — Foundations of AI Red Teaming and How AI Red Teaming Works: Methods, Tools, and Real-World Testing — which explore the basics and practical applications of red teaming. In this third part, we examine the key challenges, limitations, and open questions facing the field today.
One of the fundamental limitations is that it’s virtually impossible to identify every potential failure mode in an AI system. These models are highly complex and can respond to prompts in an almost infinite number of ways. A system might behave correctly across a thousand interactions and then fail on the one-thousand-and-first, triggered by an unforeseen prompt or context. It is impossible to certify that a model is completely safe even after extensive red teaming. As a result, red teaming significantly reduces risk but doesn’t abolish it. The evolving nature of AI usage and the ingenuity of users, benign or malicious, mean developers must remain constantly vigilant. Updates, patches and new safeguards often follow even well-executed red teaming as new vulnerabilities emerge post-deployment.
Another core challenge is the ongoing race between offense and defense. As red teams discover new exploits, such as jailbreak prompts or prompt injection techniques, developers work quickly to patch those issues. But with every fix, new vulnerabilities can emerge, especially as models become more powerful or are fine-tuned for new tasks. Critics argue that especially dangerous AI capabilities might not be worth deploying at all, given that determined actors might always find a way around safeguards. However, supporters of red teaming maintain that it is better to engage in this race under controlled conditions than to leave systems open to unchecked exploitation. What this underscores is that red teaming is not a one-off procedure, but an ongoing commitment. Every new model release, fine-tuning update or application context may require renewed testing.
A third issue is the lack of standardization. Until recently, there has been little consensus on how red teaming should be conducted. Different organizations often employ different methods, with varied emphasis, some focusing on social harms like bias and toxicity, others on security threats like prompt injection. Without a common benchmark or agreed-upon methodology, comparing claims such as “Model A is safer than Model B” is difficult. Experts have increasingly called for standardized practices and shared frameworks for red teaming AI. While efforts led by agencies like NIST and J-AISI are underway to create unified guidelines, developers still face basic questions: how extensive should red teaming be? What types of tests should be included? How should results be documented or disclosed, especially if they reveal potentially exploitable flaws?
Another complication lies in defining the scope and nature of red teaming. Unlike traditional cybersecurity testing, where threats are well-defined (e.g., SQL injection, buffer overflow), AI red teaming must tackle with a much broader range of risks. Teams must decide what constitutes a meaningful failure: just dangerous outputs like inciting violence, or also factual inaccuracy, subtle bias or hallucinations? With general-purpose models, the threat landscape can be massive. If the scope is too narrow, key issues might be missed, too broad and testing becomes unwieldy. Moreover, how adversarial should testers be? Should they attempt realistic misuse scenarios or push the model in ways typical users never would? These choices shape the outcomes and relevance of red team efforts. As researchers at Carnegie Mellon University have pointed out, the field is still developing a clear definition of what “AI red teaming” entails and how it aligns with or diverges from classic red teaming practices in cybersecurity.
A further challenge is building red teams with the right mix of expertise and perspectives. Effective red teaming often requires input from across disciplines (AI engineers, security researchers, psychologists, ethicists and domain-specific experts). For example, identifying subtle forms of social bias requires cultural insight, while testing for misinformation might require subject-matter expertise in medicine, law or politics. Assembling such diverse teams is resource-intensive and even well-resourced companies can struggle to avoid blind spots. Homogeneous teams may overlook risks that would be obvious to others. This also raises ethical and governance questions: should red teaming be an internal-only process or should independent external groups, including civilians, be allowed to test and evaluate models? Transparency advocates argue that broader participation helps ensure fairness and accountability and that red teaming shouldn’t be the sole responsibility of the companies developing the technology.
Closely tied to this is the issue of disclosure. When red teams discover a serious vulnerability, say, a reliable method to get a model to output dangerous instructions, there’s often a debate about who should be informed and when. On one hand, responsible disclosure to the broader AI community can lead to improved safeguards and shared best practices. On the other hand, revealing too much too soon could allow malicious actors to exploit the vulnerability before it’s patched. AI companies often err on the side of caution, withholding specific prompt chains or attack methods until mitigations are in place. For instance, OpenAI initially withheld exact jailbreak examples but later shared them after fixes were implemented. Striking the right balance between transparency and risk management remains an open challenge and the field continues to search for norms and best practices in this area.
Finally, there’s the question of how red teaming impacts the usability of AI systems. Fixes applied after red team testing can sometimes lead to overly cautious models, ones that refuse safe queries or exhibit reduced performance on legitimate tasks. This has led to user frustration in some cases, where systems are seen as overly filtered or too restricted in their capabilities. Companies are exploring nuanced solutions, such as letting models explain refusals or creating tiered access for research and high-trust users, but these too involve trade-offs. What’s clear is that red teaming decisions are not just technical, they shape the user experience, the model’s behavior and public perceptions of fairness and freedom.
Despite these challenges, the consensus in the industry is that doing nothing is not an option. The answer is to do red teaming better, not to skip it. Organizations are investing in research to automate parts of red teaming, developing better evaluation benchmarks and training more people in this new discipline of AI adversarial testing. The hope is that over time, AI red teaming will become more standardized and effective, just as software security testing has improved over the years.