Offline vs. Online LLM evaluation: what each catches, what each misses

Offline LLM eval catches regressions you introduce; online eval catches silent model updates and drift. What each misses, and how to run both.

TL;DR — Offline and online LLM evaluation answer two different questions, and each one is blind to the failures the other catches. Offline evaluation runs pre-deploy on a fixed dataset and catches the regressions you introduce yourself, through a code or config change. Online evaluation runs in production on real traffic and catches everything that happens to you instead: silent model updates, input drift, upstream dependency changes. Skip online eval and run offline-only, and your production quality will degrade in ways you cannot see until a user reports it — and by then, the incident already has a support ticket number.

‍

A team runs a thorough offline evaluation before shipping a RAG update. Faithfulness holds at 0.89, comfortably above the 0.80 threshold, and the deploy ships. Six weeks later, support tickets spike. The model behind the provider's API had been silently swapped underneath them — the offline eval never touched the live model, production monitoring didn't exist, and the team found out the same way their customers did.

That gap is this article.

What the distinction actually means

Offline evaluation runs before deployment, on a fixed dataset, against one specific version of the system. You control the queries, the expected outputs, the model version, the judge — everything. It's evaluation as a gate: does this version clear the bar before it ships?

Online evaluation runs in production, on real user traffic, on the live system. You control exactly one variable: the sampling rate. It's evaluation as a monitor: is the system still clearing the bar now that it shipped?

The failure modes each one catches don't overlap. Offline eval catches what you broke. Online eval catches what broke without anyone touching anything.

What offline evaluation catches, and what it misses

Offline evaluation, covered in depth in the LLM evaluation guide, catches the regressions you introduce yourself. Change a prompt, bump a model version, tweak retrieval config — the offline run compares the new version against baseline and flags what got worse. For that job, it works well: the golden dataset is a stable reference, scoring is reproducible, and the result is a clean pass or fail.

What it structurally cannot catch falls into three buckets.

Changes that happen to you, not by you. Providers update the model behind a version name with no version bump and no advance notice — GPT-4 variants shifted behavior repeatedly through 2024 with no corresponding change in the string you were pinning against. Your prompts are untouched, your config is untouched, and you haven't run a new eval because nothing in your repo changed. Offline evals run against a pinned model version by design. They have no mechanism for seeing a live behavior shift, because the thing that shifted isn't in the eval at all.

Input distribution drift. Your golden dataset is a snapshot of the query distribution at build time — usually internal testing or early user sessions. Real users show up with different vocabulary, broader scope, edge cases your testers never thought to hit, and offline eval has no signal for any of it. The golden-set score still reads 0.87. Production failure rate is climbing because 40% of production queries fall outside what the dataset ever covered.

Behavioral drift under load or sequence. Some failure modes only show up under conditions a golden dataset can't reproduce: concurrent requests, the tail end of a long conversation, a specific session pattern. Offline eval runs single queries against a fixed set. It never sees the load or the sequence, so it never sees the failure.

What online evaluation catches, and its constraints

Online evaluation catches exactly what offline structurally can't: provider-side model updates, input distribution drift, upstream dependency changes, production-specific behavioral patterns. Its constraints are just as real.

You can't evaluate everything. Human review at production scale stops being feasible somewhere around a few hundred requests a day, so LLM-as-a-judge is the only method that scales. That means online eval quality is bounded by how well the judge is calibrated — a poorly calibrated judge hands you confident-looking numbers that don't track actual quality at all.

Latency budget. Score every production request inline and you've added hundreds of milliseconds and doubled inference cost, on every request, forever. The workable version samples a fraction of traffic and scores it asynchronously, off the critical path.

Ground truth is usually gone. Offline eval scores against a known-correct answer. Online eval scores without one. Faithfulness and relevance don't need ground truth to score — correctness does, unless the task happens to have a deterministic check.

The sampling rate that actually works is 5 to 10% of traffic for most volumes. Below roughly 500 scored examples a day, either raise the rate or batch scoring across a longer window. A 0.5% sample on a 1,000-request-a-day system gets you five scored examples — nowhere near enough to detect a five-point quality drop with any confidence.

The three failure modes only online eval catches

‍Silent model updates. The most common production incident in LLM systems has no corresponding code change — OpenAI, Anthropic, and Google have all updated model behavior behind a fixed version string without a changelog entry that maps to it. Your offline evals still pass, because they pin that same version string. The live system is quietly running a different model than the one you evaluated. Online monitoring catches the shift on real traffic. Offline eval was never going to.‍
Query distribution shift. The golden dataset was built from internal testing. Real users show up with different vocabulary, broader scope, expectations your testers never had. The golden set still reports 0.87 faithfulness while the production fail rate sits at 0.21, because 40% of production queries never touched the distribution the dataset covers. Online eval on real traffic is the only signal that catches this — the offline number was never wrong, it was just answering a question nobody was asking anymore.‍
Upstream dependency changes. Retrieval config, database content, external APIs, a knowledge base update — all of these move output quality without touching the model or the prompt. Update the knowledge base with new policies and the correct answer changes, but the golden dataset still reflects the old ones. Online eval on post-update traffic catches the mismatch immediately. Offline eval, checked against an unchanged dataset, has no way to know anything moved.

Embedding-based drift detection

Embedding-based drift detection complements score monitoring instead of replacing it. Track the statistical distribution of production query embeddings over time. When it shifts — new clusters forming, existing ones moving — the system is receiving queries meaningfully different from what the golden dataset covers.

The value is entirely in the timing. Score degradation is a lagging indicator: by the time faithfulness drops 0.05 on the online monitor, the distribution shift behind it has been accumulating for days, sometimes weeks. Embedding drift flags the shift while it's still happening, not after the score has already moved.

The implementation is straightforward. Embed each production query, or a sampled fraction, with the same embedding model your retrieval system already uses. Track the distribution with a two-sample statistical test — MMD (maximum mean discrepancy) and KL divergence are the two people actually reach for — and alert when it moves past the historical baseline by a set threshold.

What it won't tell you is whether the new queries are harder or easier to handle. Drift is a flag for a human to look at, not a verdict on quality. Pull a sample, read them, and then decide: update the golden dataset, change the system, or add coverage to the pre-deploy eval set.

Canary evaluation and shadow scoring

Two techniques sit between offline and online when you're about to ship a change.

‍Canary evaluation routes a slice of production traffic, typically 5 to 20%, to the new model version or prompt before full rollout. Both versions get scored on their own traffic, and the canary has to match or beat the current version before the split widens. A canary is not a staging environment — it's production, for a real subset of real users, and incident response has to cover canary failures the same as any other. Degrading quality for 10% of users is still degrading quality for those users.
‍Shadow scoring runs the new version alongside the current one on every production request, scores both, and serves only the current version's response. Zero user exposure to the candidate, real comparison data. The cost is real too: shadow scoring roughly doubles inference spend and adds latency to a path users never see but you still pay for.

The choice comes down to exposure tolerance. Use shadow scoring when any user exposure to the candidate is unacceptable — safety-sensitive work, regulated domains, anything you're not confident about yet. Use canary evaluation when a small failure rate on a production subset is a fair trade for faster, more realistic data at a fraction of the infrastructure cost.

Production sampling, practical design

A production sampling system comes down to five components.

‍Request logging is the foundation everything else sits on. Log every request, or a consistently sampled fraction, with everything needed to evaluate it later: query, retrieved context for RAG, model response, model version, prompt hash, timestamp. Skip the model version field and you've made it structurally impossible to attribute a score change to a specific model update — this is the single most common logging gap teams ship with.
‍Sampling strategy decides which requests get scored. Random sampling covers most systems fine. Stratified sampling — guaranteed coverage across query types and user segments — does better, but only if you define the strata upfront. Importance sampling, which oversamples requests that look likely to fail based on confidence scores or prior patterns, is the right call when you're hunting rare but severe failure modes specifically.
‍Async scoring keeps judge inference off the critical path entirely. Sampled requests route to a queue and get scored in seconds to minutes, not milliseconds — the queue mechanics themselves are covered in the automated LLM evaluation guide.
‍Score aggregation is what turns raw scores into a signal anyone can act on. Aggregate daily or weekly, never request by request, and always segment by query type and user segment. A system running five query types can watch one collapse from 0.82 to 0.61 while the aggregate barely moves, 0.85 to 0.83. Category-specific regressions hide inside a stable-looking aggregate every time.
‍Alert routing needs two trigger types, not one: absolute threshold violations (fail rate exceeds X% in category Y) and trend signals (fail rate climbing Z% a week for three weeks running). Trend alerts are what catch the slow-moving regression that never breaches an absolute threshold at all. Ship only absolute thresholds and gradual degradation sails through every time.

When each is sufficient

Offline evaluation alone is fine for internal tooling with a small, stable user base and a team close enough to notice quality problems through direct use. It stops being fine for any production system past a few hundred daily active users, any system where real users behave differently than internal testers did, or anything running on a model API where the provider controls the underlying model.

Online evaluation alone is never enough, full stop. It catches regressions after users have already lived through them — by the time online monitoring flags a meaningful drop, the regression has been running for however long it took to accumulate statistical signal, often days. Pre-deploy offline eval is the only mechanism that catches a regression before a single user sees it.

The right architecture runs both: offline as the deploy gate, online as the continuous monitor. Offline answers "did this version clear the bar before it shipped?" Online answers "is the live system still clearing it now?" When the two answers diverge, it almost always means something external hit production after the last deploy.

Platforms like Galtea evaluate against one shared quality specification, so the pre-deploy gate and the production monitor are checking the same criteria instead of two thresholds calibrated separately by two different teams.

Common mistakes

Skipping online monitoring after building offline eval. Thorough pre-deploy testing buys confidence. Confidence is exactly what lets a production system drift for weeks before anyone checks. The better the offline eval, the worse the surprise when it turns out not to be enough.

Sampling rates set too low. A 0.5% sample on 1,000 requests a day is five scored examples — not enough to detect a real quality drop with any confidence. Raise the rate, or batch scoring across a longer window.

Treating online and offline scores as equivalent. They come from different query distributions, have different ground truth availability, and often run through judges calibrated for different purposes. The same number, 0.85 faithfulness, means something different depending on which one produced it.

Not logging model version per request. The online monitor flags a drop starting on a specific date, and the first question is always what changed. Without model version in the logs, a provider-side change is invisible in your own data — you'll never find it there.

Alert fatigue from thresholds set too tight. An alert that fires constantly gets disabled, permanently, usually within a month. Set the threshold to the quality drop that actually requires action, not to whatever produces a quiet dashboard during normal operation.

Watching aggregate scores without segment breakdowns. Category-specific regressions vanish inside a stable aggregate every time. Monitor per query type, not just the overall number, or you won't see the failure until it's big enough to move the average.