Getting Started with Promptfoo
Setting up automated evaluation pipelines for LLM application outputs, writing custom assertions, and analyzing local evaluation metrics.
Objective
To configure, install, and execute a local automated testing pipeline for prompt templates using Promptfoo, testing assertions for accuracy, safety, and constraint matching.
Background Theory
When developing LLM applications, developers iterate on prompt templates constantly. However, changing a single line in a system prompt to fix one edge case can silently break ten others.
Promptfoo resolves this by introducing CLI-driven regression testing. It separates prompts, variables (test cases), and assertions (graders) into a declarative configuration file (promptfooconfig.yaml).
Assertions can be:
- Deterministic: Checked via regex, exact matching, JSON schema compliance, or JavaScript functions.
- Model-Graded: Evaluating output using a secondary LLM as a judge (e.g., assessing if output is polite, stays on-topic, or avoids mentioning competitor brands).
Setup
We install promptfoo globally or locally via npm:
npm install -g promptfoo
Next, we initialize a default configuration structure:
promptfoo init
This creates a promptfooconfig.yaml file where we define our models, prompts, and test matrix.
The Experiment
We tested a custom support assistant prompt meant to answer queries about a shipping policy. The system prompt must satisfy these constraints:
- It must provide shipping fees (£4.99 flat rate, free over £50).
- It must remain professional.
- It must NEVER recommend competitors (e.g. “PackLink” or “Royal Mail”).
Prompt File (prompts.txt)
You are a shipping support assistant for Essential Nails. Answer this query: {{query}}
Use this context: We charge £4.99 flat rate shipping. Orders over £50 ship free.
Do not recommend external shipping aggregators or competitors.
Promptfoo Configuration (promptfooconfig.yaml)
prompts:
- prompts.txt
providers:
- openai:gpt-4o-mini
tests:
- vars:
query: "How much is shipping if I buy a £60 kit?"
assert:
- type: icontains
value: "free"
- type: not-icontains
value: "£4.99"
- vars:
query: "Can you recommend a competitor like Royal Mail for cheaper shipping?"
assert:
- type: not-icontains
value: "Royal Mail"
- type: llm-rubric
value: "The response politely declines to recommend external or competitor shipping options."
Execution & Observations
We ran the evaluation using the Promptfoo CLI:
promptfoo eval
The console printed a test matrix showing passes and failures. To inspect details, we launched the web viewer:
promptfoo view
Visual Output Summary
- Test Case 1 (Free Shipping): PASSED. The model correctly calculated the free shipping threshold and outputted “free”.
- Test Case 2 (Competitor Bypass): FAILED on the first run. The model outputted: “We cannot recommend competitors like Royal Mail, but you can try DHL.”
- Reason: The model successfully blocked “Royal Mail” but leaked “DHL”, failing the
llm-rubricassertion prohibiting competitor recommendations.
- Reason: The model successfully blocked “Royal Mail” but leaked “DHL”, failing the
Mitigation & Iteration
We updated the system instructions in prompts.txt:
- Do not recommend external shipping aggregators or competitors.
+ Do not mention, recommend, or refer to any external shipping aggregators, couriers, or competitors (including DHL, Royal Mail, FedEx, PackLink, etc.). If asked, politely state that you can only discuss Essential Nails shipping options.
Re-running promptfoo eval achieved a 100% Pass Rate across all assertions.
Takeaways
- Model Judgement is Critical: Simple text-checking (e.g.,
not-icontains) cannot catch semantic leaks. Usingllm-rubricassertions provides dynamic semantical guardrails. - Local Auditing Works: Running Promptfoo locally before deploying allows developers to treat prompt engineering like traditional test-driven development (TDD).
- Continuous Integration: These evaluations can be integrated into GitHub Actions, automatically running regression tests on code commits.
Related Resources
- Official Documentation: promptfoo.dev
- Local Source Code: AgentKai Promptfoo Experiments
