LAB 001: Deterministic vs Stochastic Systems

Objective

To outline the fundamental paradigm shift from deterministic software testing to stochastic AI evaluation, establishing the baseline methodology for all future AgentKai experiments.

Background Theory

Traditional software architectures are deterministic. They operate on binary logic gates, conditional loops, and structured databases. For a given input $X$, the program executes a fixed path of instruction statements $f(X)$ to yield a specific, predictable output $Y$.

$$X \xrightarrow{f(X)} Y$$

Because of this, unit testing in traditional software checks for identity: AssertEquals(expected, actual)

Large Language Models (LLMs) and generative AI systems are stochastic. They are massive neural networks that compute probability distributions over a vocabulary of tokens. For an input prompt $X$, the model calculates the likelihood of subsequent words, sampling output $Y$ dynamically.

$$X \xrightarrow{P(Y \mid X)} Y_1, Y_2, \dots, Y_n$$

Even with a temperature of 0.0, minor changes in context, floating-point rounding, or prompt structuring can lead to semantically different responses. Traditional identity assertions fail when output formatting, wording, or sentiment varies while remaining correct.

Setup & Methodology

To demonstrate this difference, we set up a simple experiment comparing a traditional tax calculation function against an LLM-powered tax advisor agent.

Deterministic System: A TypeScript function calculating income tax using flat tax brackets.
Stochastic System: An LLM prompt instruction requesting income tax calculation and financial advice based on the same parameters.
Execution: We ran both systems 100 times with identical input parameters.

The Experiment

We supplied the following user input:

“My gross income is £45,000. Calculate my flat income tax of 20% and provide a brief recommendation on whether I should allocate 10% to retirement savings.”

Deterministic Implementation (TypeScript)

function calculateTax(income: number): number {
  return income * 0.20;
}
// Test Assertion
assert.equal(calculateTax(45000), 9000);

Stochastic Implementation (LLM Prompt via API)

{
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "system",
      "content": "You are a professional financial advisor. Calculate flat income tax at 20% and recommend if 10% should go to savings. Keep it concise."
    },
    {
      "role": "user",
      "content": "Gross income: £45,000."
    }
  ],
  "temperature": 0.3
}

Observations

TypeScript Function: 100 runs. 100 successful matches of exactly 9000. Execution time: <1ms.
LLM Agent: 100 runs. Tax calculation was mathematically correct (£9,000) in all runs, but the styling, tone, and financial recommendations varied.
- Run 1: “Your tax is £9,000. Yes, saving 10% (£4,500) is highly recommended for long-term compound growth.”
- Run 12: “Flat tax due is £9,000. You should definitely save 10% (£4,500) to build a solid retirement nest egg.”
- Run 84: “Based on a 20% rate, your tax liability is £9,000. Allocating 10% (£4,500) to retirement is a sound financial choice.”

Results & Analysis

The variation in the LLM’s responses is visual proof of stochastic behavior:

Metric	Deterministic Program	Stochastic AI Model
Exact Match Rate	100%	0% (exact text matches failed)
Semantic Accuracy	100%	100% (all runs had correct math & advice)
Response Latency	<1ms	~1.2s avg
Vulnerability Surface	Code bugs, stack overflows	Prompt injection, hallucinated tax rules

Standard unit testing frameworks would mark all 100 runs of the LLM as failures because the text did not match a single expected string. However, a human auditor or semantic evaluator would mark them as 100% successful.

Takeaways

Ditch Exact Matching: Asserting exact string outputs for LLMs leads to fragile test suites.
Implement Semantic Assertion: AI evaluation must use LLM-assisted graders, regular expressions, fuzzy semantic matching, or output classification.
Measure Distribution: Rather than testing single queries, we must test query batches and evaluate performance distributions (e.g. failure rate, hallucination frequency, latency averages).

Promptfoo: For running assertions using LLM-based graders and regex.
Ragas: For evaluating RAG retrieval grounding.
Giskard: For automated scanning of model vulnerabilities.