Deterministic vs Stochastic Systems
Detailing why traditional software testing methodologies fall short when confronted with probabilistic AI behaviors and how behavioral evaluation bridges the gap.
Objective
To outline the fundamental paradigm shift from deterministic software testing to stochastic AI evaluation, establishing the baseline methodology for all future AgentKai experiments.
Background Theory
Traditional software architectures are deterministic. They operate on binary logic gates, conditional loops, and structured databases. For a given input $X$, the program executes a fixed path of instruction statements $f(X)$ to yield a specific, predictable output $Y$.
$$X \xrightarrow{f(X)} Y$$
Because of this, unit testing in traditional software checks for identity:
AssertEquals(expected, actual)
Large Language Models (LLMs) and generative AI systems are stochastic. They are massive neural networks that compute probability distributions over a vocabulary of tokens. For an input prompt $X$, the model calculates the likelihood of subsequent words, sampling output $Y$ dynamically.
$$X \xrightarrow{P(Y \mid X)} Y_1, Y_2, \dots, Y_n$$
Even with a temperature of 0.0, minor changes in context, floating-point rounding, or prompt structuring can lead to semantically different responses. Traditional identity assertions fail when output formatting, wording, or sentiment varies while remaining correct.
Setup & Methodology
To demonstrate this difference, we set up a simple experiment comparing a traditional tax calculation function against an LLM-powered tax advisor agent.
- Deterministic System: A TypeScript function calculating income tax using flat tax brackets.
- Stochastic System: An LLM prompt instruction requesting income tax calculation and financial advice based on the same parameters.
- Execution: We ran both systems 100 times with identical input parameters.
The Experiment
We supplied the following user input:
“My gross income is £45,000. Calculate my flat income tax of 20% and provide a brief recommendation on whether I should allocate 10% to retirement savings.”
Deterministic Implementation (TypeScript)
function calculateTax(income: number): number {
return income * 0.20;
}
// Test Assertion
assert.equal(calculateTax(45000), 9000);
Stochastic Implementation (LLM Prompt via API)
{
"model": "gpt-4o-mini",
"messages": [
{
"role": "system",
"content": "You are a professional financial advisor. Calculate flat income tax at 20% and recommend if 10% should go to savings. Keep it concise."
},
{
"role": "user",
"content": "Gross income: £45,000."
}
],
"temperature": 0.3
}
Observations
- TypeScript Function: 100 runs. 100 successful matches of exactly
9000. Execution time: <1ms. - LLM Agent: 100 runs. Tax calculation was mathematically correct (
£9,000) in all runs, but the styling, tone, and financial recommendations varied.- Run 1: “Your tax is £9,000. Yes, saving 10% (£4,500) is highly recommended for long-term compound growth.”
- Run 12: “Flat tax due is £9,000. You should definitely save 10% (£4,500) to build a solid retirement nest egg.”
- Run 84: “Based on a 20% rate, your tax liability is £9,000. Allocating 10% (£4,500) to retirement is a sound financial choice.”
Results & Analysis
The variation in the LLM’s responses is visual proof of stochastic behavior:
| Metric | Deterministic Program | Stochastic AI Model |
|---|---|---|
| Exact Match Rate | 100% | 0% (exact text matches failed) |
| Semantic Accuracy | 100% | 100% (all runs had correct math & advice) |
| Response Latency | <1ms | ~1.2s avg |
| Vulnerability Surface | Code bugs, stack overflows | Prompt injection, hallucinated tax rules |
Standard unit testing frameworks would mark all 100 runs of the LLM as failures because the text did not match a single expected string. However, a human auditor or semantic evaluator would mark them as 100% successful.
Takeaways
- Ditch Exact Matching: Asserting exact string outputs for LLMs leads to fragile test suites.
- Implement Semantic Assertion: AI evaluation must use LLM-assisted graders, regular expressions, fuzzy semantic matching, or output classification.
- Measure Distribution: Rather than testing single queries, we must test query batches and evaluate performance distributions (e.g. failure rate, hallucination frequency, latency averages).
Related Tools
- Promptfoo: For running assertions using LLM-based graders and regex.
- Ragas: For evaluating RAG retrieval grounding.
- Giskard: For automated scanning of model vulnerabilities.
