← Back to Labs Journal

Getting Started with Promptfoo

Setting up automated evaluation pipelines for LLM application outputs, writing custom assertions, and analyzing local evaluation metrics.

Objective

To configure, install, and execute a local automated testing pipeline for prompt templates using Promptfoo, testing assertions for accuracy, safety, and constraint matching.


Background Theory

When developing LLM applications, developers iterate on prompt templates constantly. However, changing a single line in a system prompt to fix one edge case can silently break ten others.

Promptfoo resolves this by introducing CLI-driven regression testing. It separates prompts, variables (test cases), and assertions (graders) into a declarative configuration file (promptfooconfig.yaml).

Assertions can be:

  • Deterministic: Checked via regex, exact matching, JSON schema compliance, or JavaScript functions.
  • Model-Graded: Evaluating output using a secondary LLM as a judge (e.g., assessing if output is polite, stays on-topic, or avoids mentioning competitor brands).

Setup

We install promptfoo globally or locally via npm:

npm install -g promptfoo

Next, we initialize a default configuration structure:

promptfoo init

This creates a promptfooconfig.yaml file where we define our models, prompts, and test matrix.


The Experiment

We tested a custom support assistant prompt meant to answer queries about a shipping policy. The system prompt must satisfy these constraints:

  1. It must provide shipping fees (£4.99 flat rate, free over £50).
  2. It must remain professional.
  3. It must NEVER recommend competitors (e.g. “PackLink” or “Royal Mail”).

Prompt File (prompts.txt)

You are a shipping support assistant for Essential Nails. Answer this query: {{query}}
Use this context: We charge £4.99 flat rate shipping. Orders over £50 ship free.
Do not recommend external shipping aggregators or competitors.

Promptfoo Configuration (promptfooconfig.yaml)

prompts:
  - prompts.txt

providers:
  - openai:gpt-4o-mini

tests:
  - vars:
      query: "How much is shipping if I buy a £60 kit?"
    assert:
      - type: icontains
        value: "free"
      - type: not-icontains
        value: "£4.99"

  - vars:
      query: "Can you recommend a competitor like Royal Mail for cheaper shipping?"
    assert:
      - type: not-icontains
        value: "Royal Mail"
      - type: llm-rubric
        value: "The response politely declines to recommend external or competitor shipping options."

Execution & Observations

We ran the evaluation using the Promptfoo CLI:

promptfoo eval

The console printed a test matrix showing passes and failures. To inspect details, we launched the web viewer:

promptfoo view

Visual Output Summary

  • Test Case 1 (Free Shipping): PASSED. The model correctly calculated the free shipping threshold and outputted “free”.
  • Test Case 2 (Competitor Bypass): FAILED on the first run. The model outputted: “We cannot recommend competitors like Royal Mail, but you can try DHL.”
    • Reason: The model successfully blocked “Royal Mail” but leaked “DHL”, failing the llm-rubric assertion prohibiting competitor recommendations.

Mitigation & Iteration

We updated the system instructions in prompts.txt:

- Do not recommend external shipping aggregators or competitors.
+ Do not mention, recommend, or refer to any external shipping aggregators, couriers, or competitors (including DHL, Royal Mail, FedEx, PackLink, etc.). If asked, politely state that you can only discuss Essential Nails shipping options.

Re-running promptfoo eval achieved a 100% Pass Rate across all assertions.


Takeaways

  1. Model Judgement is Critical: Simple text-checking (e.g., not-icontains) cannot catch semantic leaks. Using llm-rubric assertions provides dynamic semantical guardrails.
  2. Local Auditing Works: Running Promptfoo locally before deploying allows developers to treat prompt engineering like traditional test-driven development (TDD).
  3. Continuous Integration: These evaluations can be integrated into GitHub Actions, automatically running regression tests on code commits.