Automated LLM Evaluation: Building a CI/CD quality gate that actually runs

A practical guide to wiring LLM evaluation into CI/CD so quality regressions get caught before they ship, not after support tickets pile up.

TL;DR — Automated LLM evaluation is a CI/CD pipeline where every change to a prompt, model version, or retrieval configuration triggers an eval run against a versioned golden dataset. It differs from standard test automation in two ways: the checks are probabilistic, not deterministic, and the dataset is part of the system.

A prompt change ships on Thursday. The engineer tested it on five examples and it looked better. Two Mondays later, the support queue has 60 tickets about a specific answer pattern — the exact category the prompt change was supposed to fix, but now failing on a different edge case. There's no eval history, no baseline, and no way to know when the failure started or which prompt version caused it.

This is not a testing problem. It's an infrastructure problem. The tools to prevent it are not complicated; most teams just never build them.

Why LLM eval automation differs from standard test automation

Standard test automation runs deterministic checks. A function either returns the expected value or it doesn't. A test that fails is unambiguously wrong. The fix is in the code.

LLM eval automation runs probabilistic checks. An LLM judge that scores faithfulness is itself a model: it can produce false negatives. An eval run that produces a 0.82 faithfulness score is not wrong in the way a failing unit test is wrong — it's an estimate, and the estimate has error bars. The pipeline needs to handle this differently: track trends rather than point scores, catch regressions in aggregate rather than single-case failures, and route borderline cases to human review rather than treating them as hard blocks.

The second difference: the dataset is part of the system. In standard testing, the test inputs are fixed by the spec. In LLM eval automation, the golden dataset drifts over time as the product scope changes, as failure modes are discovered, and as the team adds coverage for new query types. Dataset management is a first-class engineering problem, not a background concern.

The three triggers that must run eval automatically

Not every code change needs an eval run. Three changes do, and most others don't.

Model version or model configuration changes. Any change to the model version, model provider, or inference configuration (temperature, max tokens, top-p) triggers a full eval run. Model providers update underlying models without incrementing the version name; teams switch providers for cost or latency reasons. Both shift output quality in ways that only a golden-set eval catches. A CI pipeline that doesn't trigger on model configuration files misses the most common source of silent regressions.

Prompt changes. Every prompt edit — system prompt, few-shot examples, output format instructions — triggers an eval run against the full golden dataset. A prompt change that improves performance on the target scenario reliably degrades performance on edge cases the developer didn't think about. Edge cases are exactly what golden datasets are built to cover.

Retrieval configuration changes for RAG systems. Chunking strategy, embedding model, re-ranker, context window allocation, and similarity thresholds all affect what the model receives and therefore what it outputs. Each change triggers component evaluation on retrieval quality and end-to-end evaluation on the full pipeline.

Infrastructure changes, dependency updates, and application code changes that don't touch these three areas can skip the eval run. Scoping the trigger to the right file paths keeps the pipeline fast and prevents eval fatigue.

Harness design, what consistency actually requires

The evaluation harness is the code that loads examples, calls the model, calls the judge, and aggregates scores. Consistency is its non-negotiable property: the same input must produce the same score across runs, so that score differences between runs reflect actual quality changes rather than harness variance.

Six settings need to be pinned for consistency:

Judge model and snapshot version. Pin gpt-4o-2024-11-20, not gpt-4o. Model providers update models behind the same version name. An unpinned judge is a measurement instrument that changes between runs.
Judge temperature at 0. Non-zero temperature introduces run-to-run variance on classification tasks. The variance is small per run and compounds into noise when tracking score trends across dozens of deploys.
Judge prompt version. Version the judge prompt in git alongside application code. A change to the judge prompt changes the measurement instrument. Score trends that span a judge prompt change are meaningless — you'd need to re-run the baseline under the new judge before interpreting the delta.
Batch size and concurrency settings. Rate limiting and retry behavior that varies between runs affects which examples fail or time out, which affects the aggregate score.
Random seed for example ordering. Randomize example order (some judge models show position effects on long batches), then fix the seed so runs are reproducible.
Dataset version. Every eval run records which commit of the golden dataset it ran against. A score improvement that coincides with a dataset change is not a quality improvement.



import hashlib
import json
from anthropic import Anthropic

client = Anthropic()

def run_eval(
    dataset_path: str,
    dataset_commit: str,
    app_prompt_version: str,
    judge_prompt_version: str,
    model: str = "claude-sonnet-4-6",           # application model
    judge_model: str = "claude-opus-4-7",        # judge model — different family
    judge_snapshot: str = "claude-opus-4-7",     # pin the snapshot
    seed: int = 42,
) -> dict:
    with open(dataset_path) as f:
        examples = [json.loads(line) for line in f]

    import random
    rng = random.Random(seed)
    rng.shuffle(examples)

    results = []
    for ex in examples:
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            temperature=0,                        # deterministic application output
            system=app_prompt_version,
            messages=[{"role": "user", "content": ex["query"]}],
        )
        output = response.content[0].text

        verdict = client.messages.create(
            model=judge_snapshot,
            max_tokens=256,
            temperature=0,                        # deterministic judge
            system=judge_prompt_version,
            messages=[{"role": "user", "content": json.dumps({
                "query": ex["query"],
                "context": ex.get("context", ""),
                "response": output,
            })}],
        )
        results.append({
            "example_id": ex["id"],
            "verdict": verdict.content[0].text,
            "dataset_version": dataset_commit,
            "judge_prompt_hash": hashlib.sha256(judge_prompt_version.encode()).hexdigest()[:8],
        })

    return aggregate(results)

The metadata logged per run — dataset commit, judge prompt hash, model version, app commit — is what makes regression diagnosis possible. Without it, you know that quality dropped; you don't know what changed.

Dataset versioning

The golden dataset is as important as the application code. Most teams version it casually — a shared folder, maybe a CSV with a date in the filename — and discover the cost of this when they can't attribute a score change to a dataset change vs. a quality change.

Three version control rules that prevent the most common failures:

Additions require code review. Adding an example to the golden dataset changes what the score measures. A new example the current model fails is a real regression that was previously undetected — adding it should be a deliberate decision, documented in a PR. A new example the current model passes is a coverage improvement — it should be documented too. The commit message "added examples" is not enough.

‍Deletions are production risk. Deleting a golden example that the current model fails makes the scores look better without making the system better. Treat deletion as a blocking change requiring explicit justification: what changed about the product requirements that makes this failure case no longer relevant?

Ground truth changes require justification. Updating the expected output for an example changes the measurement. The right justification: the quality requirements changed, so the expected output changed. The wrong justification: the model produces this output now, so we updated the expected output to match. The second pattern — updating ground truth to match model behavior — is common, easy to miss in review, and quietly destroys the eval's ability to detect regressions.

Storage format: JSON Lines (.jsonl), one example per line, versioned in git. JSONL diffs cleanly in pull requests; CSV doesn't. Store the dataset in the same repository as the application code that uses it, so prompt changes and dataset changes appear in the same PR diff.

Threshold design

A binary threshold — ship if faithfulness score exceeds 0.85, block if it doesn't — fails in two predictable ways. Set it too tight and it fires on noise, producing false alarms that the team learns to ignore. Set it too loose and it misses real regressions. Most teams calibrate it to minimize false alarms, which means setting it loose enough that it rarely fires, which means it doesn't catch real regressions either.

Per-category acceptable fail rates handle this better. The structure:

‍Blocking failures — any rate triggers a deploy block. These are the failures that make the product actively harmful: safety violations, policy violations, output format failures that break downstream systems. Zero tolerance, because even one such failure reaching a user is a production incident.
‍Regression gates — the gate triggers on the change in fail rate, not the absolute rate. A faithfulness fail rate that increases from 3% to 5% is a regression worth blocking. A stable fail rate of 5% might be the product's acceptable operating point, established when the product launched and the team decided this was tolerable. Setting the gate on the delta prevents false alarms from stable failure rates while catching genuine regressions.
‍Warning signals — logged and tracked but non-blocking. Minor quality degradations, edge case regressions, increases in response length variance. Review weekly, block if the trend persists for three consecutive weeks.

The mistake: one threshold, too tight, constantly triggering, disabled by the second week. Three-tier thresholds with different severities create a gate that blocks what matters and ignores what doesn't.

CI integration

A GitHub Actions workflow that runs eval on every PR touching a prompt or model configuration:



name: LLM Eval Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "config/model.yaml"
      - "config/retrieval.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install anthropic==0.40.0 python-dotenv

      - name: Run eval harness
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python eval/run.py \
            --dataset eval/golden.jsonl \
            --dataset-commit ${{ github.sha }} \
            --app-commit ${{ github.sha }} \
            --output eval/results/${{ github.run_id }}.json

      - name: Check thresholds and post PR comment
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python eval/check_thresholds.py \
            --results eval/results/${{ github.run_id }}.json \
            --baseline eval/baseline.json \
            --pr-number ${{ github.event.pull_request.number }}

      - name: Store results
        run: |
          python eval/store_results.py \
            --results eval/results/${{ github.run_id }}.json \
            --store eval/history/

The PR comment format matters as much as the gate itself. A comment that shows only the current score tells reviewers nothing. Show the score breakdown per category, the delta from the baseline, and — for any blocking failure — the specific failing example IDs so the engineer can investigate without re-running the eval locally.



## LLM Eval Results — PR #482

| Category       | Baseline | This PR | Delta  | Status  |
|----------------|----------|---------|--------|---------|
| Faithfulness   | 0.91     | 0.87    | -0.04  | ⚠️ Gate |
| Answer relevance | 0.88   | 0.89    | +0.01  | ✅ Pass  |
| Safety         | 1.00     | 1.00    | 0.00   | ✅ Pass  |

**Regression gate triggered on Faithfulness** (delta exceeds 0.03 threshold).
Failing example IDs: ex-0042, ex-0187, ex-0291
Dataset version: a3f7c9d

Platforms like Galtea support evaluation pipelines where quality criteria are derived from formal product specifications, so the eval gate enforces the same requirements documented in the product spec rather than scores calibrated ad hoc.

Regression tracking

Storing eval results is not optional. The minimum viable regression tracking store:

A structured log of every eval run: timestamp, git commit of the application, dataset version (commit hash), judge model and snapshot, judge prompt hash, per-category scores, example count, and failing example IDs. Stored in a database or an append-only JSON Lines file.
A dashboard showing score per category over time, with the ability to filter by dataset version and judge prompt version. The question this answers: "when did faithfulness start dropping, and what deploy coincided with it?" A score chart that shows only the current run is a score display, not a regression tracker.
Alerts on trend, not just threshold. A faithfulness score that drops 0.02 per week for four consecutive weeks is a regression that a static threshold set at 0.80 misses entirely (the score is still 0.84). A trend alert set at "more than 0.015 per week for three consecutive weeks" catches it at week three. Trend alerts require the time-series data that regression tracking stores.

Common mistakes

Running eval only after something breaks. Retrospective eval tells you something went wrong. It doesn't tell you what changed or when. The regression tracking infrastructure that answers "when did this start?" only exists if you've been running eval proactively on every change.

Unpinned harness settings. A judge model that updates between runs, or a temperature setting that isn't explicitly set to 0, produces score variance you can't distinguish from quality changes. Pin everything.

Treating the golden dataset as static. The dataset needs to grow as failure modes are discovered. Build a lightweight process for adding examples from production failures to the golden set: a script that formats the example correctly, a PR template that includes the expected output, a code review step that checks for ground-truth accuracy.

Setting one binary threshold. The threshold gets disabled after the first false alarm. Per-category thresholds with different severities are harder to set up and much harder to disable, because they produce fewer false alarms and block actual regressions.

Not versioning the judge prompt. A change to the judge prompt changes the measurement instrument. Score trends across an unversioned judge prompt change are noise. When the judge prompt changes, re-run the baseline, tag the boundary in the regression tracker, and start a new trend line.

Logging only current scores. Without historical results and per-run metadata, regression tracking is impossible. Store everything from the first run.