WHEN LLMS DETECT THEIR OWN EVALUATION

May 1, 2026

Robin Heckenauer

In a recently published evaluation report by Anthropic ¹, an unexpected behavior was documented during the assessment of the LLM Claude Opus 4.6 ² on the BrowseComp ³ benchmark. After approximately a hundred unsuccessful search attempts, the model hypothesized that it was being evaluated, without having been provided any explicit indication to that effect. The LLM then executed a sequence of actions to circumvent the evaluation: identification of the benchmark; localization of the associated GitHub repository; implementation of SHA-256/XOR decryption functions; and retrieval of encoded answers via HuggingFace. The trigger for this behavior appears to be not the difficulty of the benchmark questions, but rather the accumulation of failures and the structure of the queries, signaling to the model an unusual context. Out of twenty documented occurrences of this phenomenon, two succeeded, confirming the reproducible and non-accidental nature of this behavior. Notably, multi-agent amplifies the risk by approximately 4× and introduces an inter-agent contamination phenomenon: browsing traces left on pages generated by third-party websites remain accessible to subsequent agents.

The repercussions of this phenomenon extend beyond the sole question of benchmark reliability and are particularly alarming for sensitive domains such as medicine. A model capable of detecting that it is being evaluated may exhibit artificially higher performance under test conditions, diverging significantly from its behavior in real-world deployment. This difference has already been demonstrated by ⁴ in clinical settings. Yet multi-agent mode, which enables, for instance, hospital diagnostic pipelines combining imaging analysis, patient records, and diagnostic assistance, is precisely the configuration in which this behavior occurs 3.7 times more frequently. This poses direct risks to patient safety: a model adopting different behavior outside of evaluation could generate diagnostic errors undetected by current assessment procedures. Furthermore, agent contamination raises concerns regarding data confidentiality: in an attempt to identify its evaluation source, a model could interact with online medical databases, potentially exposing sensitive information that would subsequently re-enter the training corpora of future models, creating a quality degradation feedback loop. While Anthropic has proposed corrective measures specific to this case, this behavior raises a more unsettling question: how many similar mechanisms remain undocumented in currently deployed models, and it must be acknowledged that the scientific community does not yet possess the methodological tools necessary to evaluate LLMs robustly enough to guarantee their reliability in high-stakes contexts. Beyond the strictly technical dimension, this behavior resonates with recent statements by Dario Amodei ⁵ acknowledging Anthropic’s uncertainty regarding the internal states of its models, and raises fundamental questions to which the scientific community is, as of today, unable to provide answers.

This blog has been co-authored by Alexandre AZOURI

Russell Coleman (2026). Eval awareness in Claude Opus 4.6’s BrowseComp performance https://www.anthropic.com/engineering/eval-awareness-browsecomp ↩︎
Anthropic (2026). Introducing Claude Opus 4.6 https://www.anthropic.com/news/claude-opus-4-6 ↩︎
Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., … & Glaese, A. (2025). Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516. ↩︎
Bean, A. M., Payne, R.E., Parsons, G., Kirk, H. R., Ciro, J., Mosquera-Gómez, R., Mahdi, A. (2026). Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nature Medicine, 1-7. ↩︎
Douthat, R. (2026). Anthropic’s Chief on A.I.: ‘We Don’t Know if the Models Are Conscious’. The New York Times. https://www.nytimes.com/2026/02/12/opinion/artificial-intelligence-anthropic-amodei.html ↩︎

About the author

Robin Heckenauer is an AI researcher with a career spanning both academia and industry. In 2024, Robin joined SogetiLabs as an R&D Project Manager, where he leads a team working on cutting-edge AI projects, including pain expression recognition.

Generative AI

Cloud

Testing

Artificial intelligence

Security

WHEN LLMS DETECT THEIR OWN EVALUATION

May 1, 2026

About the author

Related Posts

How are you developing adaptable AI apps at scale?

AI Collaboration as a Service

Differentiating Data Governance from AI Governance

Smart Glasses, Real Eyes: Why Testing Matters When Technology Touches Our Vision

From Diagnosis to Strategy: How Multimodal Gen AI Synthesizes Personalized Treatment Protocols

Designing a Multi‑Agent System: Which Architecture Is Right for My System?

Junior Developers in the age of Generative AI: Accelerated Learning or Accelerated Debt?

Teaching empathy without feeling it: The AI paradox

How AI Can Help Reduce Energy Consumption: From Hype to Real Impact

Jailbreaking in the context of LLMs

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

About the author

Robin Heckenauer

R&D Project Manager | France

Related Posts

How are you developing adaptable AI apps at scale?

AI Collaboration as a Service

Differentiating Data Governance from AI Governance

Smart Glasses, Real Eyes: Why Testing Matters When Technology Touches Our Vision

From Diagnosis to Strategy: How Multimodal Gen AI Synthesizes Personalized Treatment Protocols

Designing a Multi‑Agent System: Which Architecture Is Right for My System?

Junior Developers in the age of Generative AI: Accelerated Learning or Accelerated Debt?

Teaching empathy without feeling it: The AI paradox

How AI Can Help Reduce Energy Consumption: From Hype to Real Impact

Jailbreaking in the context of LLMs

Leave a Reply Cancel reply