HOW I GOT 7 DIFFERENT LLMS TO THINK IDENTICALLY: ZERO VARIANCE ACROSS 210 RUNS

September 22, 2025

Antoine Aymer

Every morning, your coffee machine performs dozens of checks: water level, milk, temperature (both minimum AND maximum), beans, frother. That’s 6 conditions creating 64 possible combinations. When a specific combination occurs, it triggers the corresponding action: “Add milk” or “Clean frother” or make cappuccino.

This is systematic validation at its purest. It’s exactly what software testers deal with when they use Decision Table Test (DTT)—systematically evaluating every possible combination to ensure nothing is missed.

Where DTT makes sense

Consider any system with complex conditional logic—loan approvals at banks, insurance claim processing, promotional pricing at retailers, access control in secure facilities, or medical treatment protocols. The rules seem straightforward until you realize how many combinations exist. The basic formula is 2^n scenarios, where n is the number of binary conditions. 4 conditions create 16 scenarios. 6 conditions? 64. 10 conditions? 1,024.

But that’s just the starting point. Add enumerated conditions (like credit score ranges with 5 categories), multiple actions (approve, reject, escalate, request documents), and complex rules with dependencies (“if credit score is high AND debt ratio is low, but IF employment is unstable THEN escalate”), and the complexity multiplies beyond simple exponential growth. A real system might have 10 conditions yielding 1,024 theoretical combinations, mapped through 30 different rules to 15 possible actions—with many combinations being invalid due to business constraints.

Most teams handle this with complicated if-then statements buried in code. Some use commercial tools for test case generation. Others rely on spreadsheets. Many just test the obvious condition combinations and hope they’ve covered the edge cases.

The challenge: Making LLMs execute decision table test

How do you get an LLM to execute Decision Table Test as methodically as an expert tester? Not just mostly correct, but methodologically perfect—extracting every condition, generating every combination, identifying every impossibility, creating every test case—with zero variance across runs.

LLMs are inherently probabilistic—they generate variations, interpret creatively, and sometimes drift from instructions. But DTT demands deterministic thinking: extract ALL conditions, generate ALL combinations, identify ALL impossibilities, create ALL test cases. No shortcuts, no interpretation, no creativity. Just methodical execution.

DTT is well-established in industries where edge cases have high impact—banking, insurance, healthcare. Manual execution is tedious: 10 conditions means 1,024 theoretical scenarios to check.

My hypothesis: Could I guide an LLM through DTT with enough precision that it would think deterministically? Allow linguistic variations in expression, but achieve zero variance in logical application of the method?

Breaking down expert thinking

I documented how testers actually execute DTT. Not the textbook version—the real process with all its shortcuts and decision points.

Experts don’t evaluate all combinations at once. They work through distinct phases:

Take our coffee machine. An expert would:

List each condition separately (water ≥ 200ml, milk ≥ 100ml, etc.)
Generate all combinations systematically
Identify impossible ones (like temperature being both above 96°C and below 92°C)
Check for duplicates
Create specific test scenarios

Each step is manageable. The complexity comes from doing them all perfectly, in sequence, without losing track.

The 5-agent architecture: How each piece works

Instead of building one massive AI system, I created 5 specialized agents, each with a specific objective and carefully managed complexity:

Agent 1: The translator (condition extractor)

Objective: Extract conditions without interpretation, maintaining exactly what’s specified.

How it works: The agent receives requirements like “temperature between 92-96°C” and outputs 2 distinct conditions: “Temperature ≥ 92°C” and “Temperature ≤ 96°C”. It’s guided by strict rules: preserve all conditions as written, use precise operators (≥, ≤, =, ≠), and maintain every threshold as a separate condition.

Complexity handled: This agent manages the ambiguity of natural language. The knowledge base provides extensive patterns—when “between X and Y” creates 2 conditions, how to handle enumerated types, date ranges, and string validations. It catches subtle distinctions like “at least” versus “more than” and preserves them.

Output format: A structured table with 3 columns: Condition | Outcome if Met | Outcome if Not Met. Each row feeds directly into Agent 2.

Agent 2: The mathematician (table builder)

Objective: Generate all 2^n test situations and identify impossible combinations.

How it works: Taking the conditions from Agent 1, this agent creates a binary matrix. For 6 conditions, it generates all 64 combinations (000000 to 111111). It then evaluates each combination for logical impossibility—like temperature being both ≥92°C (true) and ≤96°C (false), which would require temperature >96°C. The knowledge base includes 15-20 common impossibility patterns: mutually exclusive states, range violations, dependency conflicts, and temporal contradictions.

Complexity handled: Pure combinatorial generation is straightforward, but identifying impossible situations requires logical evaluation. The agent must understand that certain combinations violate physical or logical constraints.

Output format: A complete decision table with conditions as rows, test situations as columns, and actions marked with X or -. Impossible situations are explicitly marked in the last row.

Agent 3: The auditor (consistency verifier)

Objective: Ensure no condition merging occurred and validate logical consistency.

How it works: This agent reviews Agent 2’s output against the original conditions from Agent 1. It checks: Are temperature conditions still separate? Do actions align with their triggering conditions? Are impossible situations genuinely impossible?

Complexity handled: This agent prevents drift—the tendency for complex conditions to get simplified during processing. It catches subtle errors like “Temperature valid” replacing the 2 separate temperature conditions.

Output format: Either a validated table or a corrected version with issues resolved. The clean output becomes Agent 4’s input.

Agent 4: The detective (uniqueness validator)

Objective: Identify and eliminate duplicate test situations.

How it works: The agent compares each column (test situation) for uniqueness. In complex business logic with dependencies, what appear as different combinations might produce identical outcomes. For instance, if a safety override blocks operation regardless of other conditions, multiple “different” scenarios collapse into 1.

Complexity handled: Pattern matching across potentially hundreds of columns. The agent must recognize functional equivalence even when the binary patterns differ. For instance, different combinations might lead to the same outcome due to business rule precedence.

Output format: A refined table with guaranteed unique test situations, ready for Agent 5.

Agent 5: The writer (test case generator)

Objective: Convert abstract test situations into executable test cases with concrete values.

How it works: Each valid test situation (series of 1s and 0s) becomes a detailed test case. For condition “Water ≥ 200ml” with value 1, it generates “Add 250ml water” (a value that satisfies the condition). For value 0, it generates “Add 150ml water” (failing the condition).

Complexity handled: The agent must generate realistic test data that precisely matches each condition’s requirement. It handles boundary values, invalid inputs, and ensures each test case is independently executable.

Output format: Structured test cases with columns: Test Case ID | Description | Step | Test Data | Expected Results. Each test case is complete and ready for execution.

The data flow between agents

The architecture is strictly sequential:

Agent 1’s condition table → Agent 2’s input requirements
Agent 2’s decision table → Agent 3’s verification target
Agent 3’s validated table → Agent 4’s uniqueness check
Agent 4’s refined table → Agent 5’s test case source

Each agent only proceeds with clean input from the previous agent. This prevents error propagation—a critical design decision that emerged from testing.

What happened when I tested it

Starting with: “Make cappuccino when there’s water (minimum 200ml), milk (minimum 100ml), coffee beans (minimum 7g), temperature between 92-96°C, and frother is clean.”

The system produced:

6 distinct conditions (temperature counted as 2)
64 test situations total
16 impossible scenarios (where Temperature < 92°C AND Temperature > 96°C simultaneously)
48 executable test cases

Each test case was complete. Test Case #1: All conditions met, expect cappuccino. Test Case #17: Everything perfect except frother not clean, expect “Clean frother” error.

Observed variations across LLMs:

Without the 5-agent architecture, different LLMs interpret this requirement inconsistently. Some extract only 5 conditions, treating “temperature between 92-96°C” as a single condition, resulting in 32 test situations. Methodologically, DTT requires splitting this into “Temperature ≥ 92°C” AND “Temperature ≤ 96°C” to test boundary failures independently.

The 5-agent architecture eliminates this inconsistency. To quantify its effectiveness, I ran 30 independent executions through the complete pipeline across 7 different LLMs: GPT-4o, GROK 4, Claude Haiku, Gemini Flash, Claude Sonnet, Claude Opus, GPT-5

Consistency results across 7 LLMs (30 runs each):

Metric	Expected	Actual Result	Variance*
Conditions Extracted	6	6 (all 210 runs)	0%
Total Test Situations	64	64 (all 210 runs)	0%
Impossible Scenarios Identified	16	16 (all 210 runs)	0%
Executable Test Cases Generated	48	48 (all 210 runs)	0%
Temperature Split (≥92°C, ≤96°C)	2 conditions	2 (all 210 runs)	0%

*Variance = percentage of runs that deviated from the expected result

Semantic expression patterns: The logical content remained identical across all runs, but models expressed the same test cases differently. GPT-4o recycled just 3 phrasings (“Water: 250ml” vs “Add 250ml water” vs “Water amount: 250ml”). GPT-5 generated 30 unique phrasings—different words, same meaning. Think of it as saying “insufficient milk” vs “milk level too low” vs “add more milk”—semantically identical, linguistically diverse.

Critical finding for testers:

100% methodological reliability across all models – Every single run (210 total) correctly:

Extracted 6 atomic conditions (including temperature as 2 separate conditions)
Generated 64 test situations (2^6)
Identified 16 impossible scenarios (Temperature <92°C AND >96°C simultaneously)
Produced 48 executable test cases

Trust assessment: The DTT logic output can be trusted completely. Zero variance in methodological correctness across 30 runs per model. The only variance is in linguistic expression, not in testing logic.

Practical impact: For test automation, any of these models will produce correct DTT results through the 5-agent architecture. Choose based on your need for output consistency (GPT-4o for repeatability) versus linguistic variety (GPT-5 for documentation).

Another variation: despite instructions to generate all test cases within the agent, some LLMs via API will output placeholders like “[Continue listing all physical test cases…]” requiring manual prompting to complete—a versioning issue rather than following the embedded instructions.

Then I tried it on real business logic with more complex conditions. Take a smart home automation system I tested: 5 conditions (occupancy, time of day with 3 states, weather with 3 states, lighting preference with 3 states, and energy mode). With enumerated values, this creates 108 possible combinations (2×3×3×3×2). The system had 13 distinct rules mapping these conditions to 12 different actions—from lighting and heating controls to security and entertainment settings.

The system implemented the DTT methodology correctly—generating all combinations, identifying logical impossibilities, and creating executable test cases as the methodology prescribes. In practice, most businesses deal with systems like this: 10-20 conditions, dozens of rules, hundreds of possible combinations. The methodology scales.

What surprised me most

The 100% methodological consistency across all 210 runs. I expected variance, edge cases, model-specific quirks. Instead, every single LLM—from GPT-4o to Claude Opus to GPT-5—extracted the same 6 conditions, generated the same 64 test situations, identified the same 16 impossible scenarios, and produced the same 48 executable test cases.

This level of reliability led to an important realization about human oversight. I initially considered various approaches: human-in-the-loop (checking each step), human-on-the-loop (monitoring the process), or human-out-of-the-loop (fully automated). Given the perfect consistency, human-out-of-the-loop made sense for this use case. There’s no value in manually verifying steps that produce identical logical output every time. Keep the human in command—setting requirements, reviewing final outputs—but let the system run autonomously.

The 2 days I spent building this weren’t about making the AI smarter. They were about decomposing expert thinking into steps an AI could execute reliably through detailed instructions. This involved testing each agent individually with 10-15 test cases, then testing the complete pipeline end-to-end, building knowledge bases with pattern examples and edge cases, eliminating ambiguities through precise language, preventing misinterpretations with explicit constraints, and applying prompt engineering techniques—few-shot examples, structured outputs, chain-of-thought reasoning. Each agent went through roughly 20 iterations, progressively simplifying instructions until they were basic enough to be executed consistently by multiple LLMs while still maintaining methodological rigor.

Once this foundation was solid, final validation took just 3 hours—using LLM-as-judge for initial verification and my own review for final validation. Each complete run through the 5-agent pipeline takes 2-3 minutes to return the full output, including all physical test cases (runtime varies by LLM). Allow more time for human testers to assess the decision table itself and review each test case—testers must take accountability for their work, even when assisted or fully generated by AI. The consistency across runs made validation efficient: I could quickly verify that each output matched the expected DTT methodology rather than debugging why different runs produced different results.

It’s like your coffee machine—it doesn’t understand coffee, but it makes perfect cappuccino by following precise rules systematically. The AI doesn’t “understand” the requirements either. It doesn’t need to. By providing detailed guidance for each step, the system achieves expert-level results with surprising effectiveness.

The future of this approach

Agentic AI systems are emerging that can plan, execute, and verify their own work. These systems might eventually handle DTT end-to-end without needing explicit 5-agent decomposition. But they’ll still need the same methodological precision—whether that’s embedded in their training, their prompts, or their architecture.

The decomposition I’ve demonstrated here—breaking expert methodology into verifiable steps—will remain valuable for understanding what these agentic systems should be doing, even if they handle the orchestration themselves. Clear specifications of expected behavior become more critical, not less, as AI systems gain autonomy.

For now, this 5-agent architecture works across every major LLM. It’s a bridge between today’s probabilistic models and tomorrow’s deterministic testing agents.

The practical reality

This approach works when:

You have clear rules (even if they’re complex)
Consistency matters more than creativity
The cost of missing edge cases is high
You need to validate logic at scale
The ROI justifies comprehensive testing (not every system needs all 1,024 test cases)

It doesn’t work when:

Rules are vague or constantly changing
You need creative problem-solving
Human judgment is essential
Context requires deep domain expertise
Risk-based testing is more appropriate than comprehensive coverage

Important limitations to acknowledge: DTT is a well-structured methodology with clear rules—extract conditions, generate combinations, identify impossibilities, create test cases. This structure enables 100% consistency. The same approach won’t work for exploratory testing, usability testing, or scenarios requiring creative interpretation. As ambiguity increases, quality degrades. That’s when using the most powerful models (GPT-5, Claude Opus) becomes essential.

There are also practical constraints I didn’t hit in my tests but you will in production: context window limitations when dealing with hundreds of conditions, API timeouts on complex computations, rate limits when processing large test suites. The 5-agent architecture can handle significant complexity, but for a 20-condition system with 1+ million combinations, you’d need to break the problem into smaller chunks (process conditions in groups), implement timeouts and retries, and potentially run agents in parallel rather than sequentially.

The bigger picture

The 5-agent architecture demonstrates that rigorous test design—with properly decomposed steps, explicit validation points, and structured outputs—can achieve deterministic results from probabilistic models. This matters now more than ever. As test automation evolves toward agentic systems, success will depend on how well we design the testing methodology, not just the automation itself.

Get the test design right, and even today’s LLMs can achieve 100% methodological consistency. That foundation will only become more critical as testing agents gain autonomy.

The specific implementation behind this 5-agent architecture is proprietary, but the principle is universal: systematic decomposition enables deterministic results from probabilistic models. No training, no fine-tuning—just 2 days of careful engineering.

P.S. – Like that coffee, this approach has a shelf life. Drink it while it’s hot.

About the author

Antoine Aymer is the Market leader supporting 15+ countries in delivering revenue and contribution target. He manages global alliance with key software vendors. He is the General Manager for Gen AI Amplifier.

Generative AI

Cloud

Testing

Artificial intelligence

Security

HOW I GOT 7 DIFFERENT LLMS TO THINK IDENTICALLY: ZERO VARIANCE ACROSS 210 RUNS

September 22, 2025

Where DTT makes sense

The challenge: Making LLMs execute decision table test

Breaking down expert thinking

The 5-agent architecture: How each piece works

The data flow between agents

What happened when I tested it

Observed variations across LLMs:

Critical finding for testers:

What surprised me most

The future of this approach

The practical reality

The bigger picture

About the author

Related Posts

Unified MCP Server: Practical AI-Driven Test Automation That Actually Works

A Bug That Blocked Millions: What Happened to CAF2025 Ticket Sales?

Training Data Exposure in LLMs: Risks for Medical AI Systems

The Babbage paradox in testing

The Making of a Trusted Testing Advisor

Did my Fine-tuning work? A practical guide to evaluating LLMs

The Future I See – A fresh take on Risk-Based Testing

7 Tips for Your First Test Manager Mission

Blast from the Past – Stop & Hammer Testing

Testing in Production: A Calculated Approach to Unlocking Value Without Risking Stability

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

Where DTT makes sense

The challenge: Making LLMs execute decision table test

Breaking down expert thinking

The 5-agent architecture: How each piece works

The data flow between agents

What happened when I tested it

Observed variations across LLMs:

Critical finding for testers:

What surprised me most

The future of this approach

The practical reality

The bigger picture

About the author

Antoine Aymer

Global CTO for Quality Engineering & Testing

Related Posts

Unified MCP Server: Practical AI-Driven Test Automation That Actually Works

A Bug That Blocked Millions: What Happened to CAF2025 Ticket Sales?

Training Data Exposure in LLMs: Risks for Medical AI Systems

The Babbage paradox in testing

The Making of a Trusted Testing Advisor

Did my Fine-tuning work? A practical guide to evaluating LLMs

The Future I See – A fresh take on Risk-Based Testing

7 Tips for Your First Test Manager Mission

Blast from the Past – Stop & Hammer Testing

Testing in Production: A Calculated Approach to Unlocking Value Without Risking Stability

Leave a Reply Cancel reply