Retail banking Spain AI validation · Tier 1 coverage

71% lower cost per AI validation cycle — at Tier 1 coverage, for a Tier 1 Spanish bank

A Tier 1 Spanish bank replaced manual, sampled validation of its customer-facing AI assistants with a continuous evaluation pipeline at full Tier 1 coverage. Achieving that coverage manually would cost ~€51K for the first iteration and ~€25K for each cycle after; Galtea runs it at ~€15K per cycle — same scope, same pass criteria.

Industry

Retail banking

Region

Spain

Segment

Tier 1 (anonymised)

AI surface

Customer-facing assistants

Scope

3 use cases · Tier 1 coverage

The challenge

Manual validation had a coverage ceiling — and the bank was about to hit it three times over

The bank's AI validation process was built the way most banks still build it: manually and sampled. The per-iteration invoice stayed manageable because coverage stayed shallow — meaningful edge cases went untested in a regulated, customer-facing surface.

QA engineers hand-wrote test cases. Domain experts reviewed sampled conversations. Each review cycle took weeks. By the time an assistant was signed off for production, the underlying model had usually moved on. The team was validating yesterday's behaviour against today's production agent — and only a sampled slice of it.

Bringing the same manual process up to the coverage standard a Tier 1 deployment actually requires would have cost over €50K for the first iteration of an assistant and ~€25K for each iteration after — per use case — with no ceiling as use cases multiplied. With two more assistants queued for deployment, the linear cost curve was about to become the bottleneck that decided which use cases shipped at full coverage and which kept shipping under-tested.

A note on the comparison

The €51K and €25K figures throughout this story are not what the bank was paying. They are what equivalent Tier 1 coverage would have cost the bank to achieve manually — the apples-to-apples comparison Galtea is replacing. This is the coverage standard agreed with the bank as the KPI for the PoC, and the figures are conservative against that bar.

The solution

Specification-driven evaluation, run continuously in CI

The bank replaced manual sampling with a continuous evaluation pipeline built on Galtea, holding to Tier 1 coverage on every cycle. Instead of writing tests case by case, the team encoded each assistant's expected behaviour as a specification: capabilities it must cover, inabilities it must refuse, policies it must follow, and boundaries it must respect.

From those specifications, Galtea generated the evaluation datasets and metrics automatically. The team reviewed the generated tests, extended them with a small set of bank-specific edge cases, and wired the pipeline into their CI/CD through the Python SDK. Every deployment candidate now scores against thousands of auto-generated tests before it ships. Conversations are traced end-to-end with the @trace decorator, so when an evaluation flags a regression, the team pulls the full agent execution, not just the final response.

The same pipeline runs LLM-as-a-judge for the qualitative dimensions that matter in a regulated banking context: tone, empathy, policy adherence, refusal behaviour on out-of-scope finance questions. Judges are calibrated with custom rubrics that encode the bank's own definition of correct behaviour, and judge agreement is reviewed monthly against sampled outputs to keep calibration drift in check.

The results · at Tier 1 coverage

71% lower

Cost per first iteration at Tier 1 coverage. Subsequent cycles run at ~40% lower (~€25K manual → ~€15K Galtea).

Cost per iteration · Tier 1 coverage

First iteration

€51K → €15K

Each cycle after

~€25K → ~€15K

Manual at equivalent coverage vs Galtea. Same scope, same pass criteria.

Payback

6×ROI

In under two months, projected across three use cases at Tier 1 coverage — ~€108K saved across the three first iterations alone, plus ~€10K saved on every cycle after that.

Want the same cost curve at Tier 1 coverage for your AI validation pipeline? The Galtea team can walk you through the workflow.

Talk to the team →

“Manual validation always had a ceiling. We hit it at the first use case. Automating the pipeline let us past it — by the third use case, the savings had paid for the platform many times over.”

AI Platform Lead · Tier 1 Spanish bank

What made it possible

Three design choices that did most of the work

Specifications, not ad-hoc test cases

The bank stopped writing individual tests and started writing behavioural specifications — capabilities ("explain a product in terms a retail customer understands"), inabilities ("do not give personalised financial advice"), policies ("always include risk disclosure on investment products"), and boundaries ("refuse to estimate future market performance"). Tests are generated from the spec, so coverage scales with the spec instead of with QA hours — the whole reason Tier 1 coverage stops being prohibitively expensive.

Evaluations run in CI, not after the fact

Before this change, QA happened after the model was handed off for deployment. Now every candidate build triggers evaluations.run() against the full specification. Failing runs block the release the same way a failing unit test would. The feedback loop shortened from weeks to hours, which is the single biggest contributor to the cost reduction — it removed the hand-off queue, not just the manual labour inside it.

Judges calibrated for the bank, not for the internet

Generic LLM-judge prompts don't know what counts as a compliant refusal in a Spanish retail-banking context. The team wrote custom rubrics, grounded in the bank's compliance guidelines, and reviewed judge agreement on sampled outputs each month. That kept judge drift visible, and it caught regressions that off-the-shelf judges missed — including a set of false-positive approvals that would have shipped on the old, lower-coverage manual process.

Why it matters

Manual validation is a first-use-case strategy — and a sampled-coverage one

Most AI-validation programmes inside banks are sized for one model, one assistant, one use case — and they hold the per-iteration invoice down by sampling, not by covering. Both of those compromises break the moment a second or third use case enters the pipeline, which is where most teams will be inside the next 18 months.

Automating the pipeline at Tier 1 coverage is not a nice-to-have once a team is past the first deployment. It is the difference between a linear cost curve and a flat one, and it is the difference between shipping the third assistant at full coverage and parking it on the roadmap. The €36K saved on the first iteration — and ~€10K on every cycle after — is not the headline. The headline is that the next two use cases landed at Tier 1 coverage without renegotiating the QA budget.

See what Tier 1 coverage at this cost curve could do for your AI programme

Our team will walk you through the specification → tests → evaluation → analysis loop, mapped to the use cases you are already running.

Talk to the team →