Tier 1 Spanish bank · 12× more AI vulnerabilities caught before production | Galtea
Retail banking Spain AI validation · risk & coverage

12× more AI vulnerabilities caught before production — at Tier 1 coverage, for a Tier 1 Spanish bank

A Tier 1 Spanish bank replaced manual, sampled validation of its customer-facing AI assistants with a continuous evaluation pipeline at full Tier 1 coverage. The new pipeline surfaced 12× more vulnerabilities per validation cycle than the bank's previous, sampled process — failures that had been shipping silently into a regulated customer surface.

Industry
Retail banking
Region
Spain
Segment
Tier 1 (anonymised)
AI surface
Customer-facing assistants
Coverage
Tier 1 · 3 use cases

Sampled validation kept the invoice small — and the risk surface large

The bank's AI validation process was built the way most banks still build it: manually and sampled. QA engineers hand-wrote test cases. Domain experts reviewed a slice of conversations. The per-iteration invoice stayed manageable because coverage stayed shallow — and shallow coverage on a customer-facing assistant in a regulated market is a risk that doesn't appear on any line item.

The team knew the gaps were there. Refusals on out-of-scope finance questions weren't tested exhaustively. Tone and policy adherence were spot-checked, not enforced. The bank's own compliance rubric had dimensions the manual process simply could not exercise at the volume needed to call them validated.

Bringing the same manual process up to the coverage standard a Tier 1 deployment actually requires would have cost over €50K for the first iteration of an assistant and ~€25K for every iteration after — per use case — with no ceiling as use cases multiplied. Coverage at that bar was not economically feasible manually. So the bank kept shipping at sampled coverage, and the vulnerabilities that lived outside the sample kept shipping with it.

A note on the comparison

"12× more vulnerabilities" is measured against the bank's prior sampled manual process — i.e. what the bank was actually doing before Galtea. The €51K / ~€25K figures elsewhere in this story refer to what manual would have cost if held to Tier 1 coverage, which is the bar Galtea now meets at ~€15K per cycle. Two comparisons, one coverage standard.

Specification-driven evaluation that actually exercises the whole behaviour

The bank replaced manual sampling with a continuous evaluation pipeline built on Galtea, holding to Tier 1 coverage on every cycle. Instead of writing tests case by case, the team encoded each assistant's expected behaviour as a specification: capabilities it must cover, inabilities it must refuse, policies it must follow, and boundaries it must respect.

From those specifications, Galtea generated the evaluation datasets and metrics automatically — thousands of auto-generated tests per cycle, covering the policy, refusal, tone, and boundary surface that sampled review could not reach. The team reviewed the generated tests, extended them with a small set of bank-specific edge cases, and wired the pipeline into their CI/CD through the Python SDK. Every deployment candidate now scores against the full specification before it ships. Conversations are traced end-to-end with the @trace decorator, so when a vulnerability is flagged, the team pulls the full agent execution — not just the final response.

The same pipeline runs LLM-as-a-judge for the qualitative dimensions that matter in a regulated banking context: tone, empathy, policy adherence, refusal behaviour on out-of-scope finance questions. Judges are calibrated with custom rubrics that encode the bank's own definition of correct behaviour, and judge agreement is reviewed monthly against sampled outputs to keep calibration drift in check.

The results · at Tier 1 coverage
12× more
AI vulnerabilities surfaced per validation cycle, vs. the bank's prior sampled manual process. Including a set of false-positive approvals that would have shipped under the old workflow.
Coverage
Tier 1
Capabilities, inabilities, policies, and boundaries exercised on every cycle — not sampled. Calibrated to the bank's own compliance rubric.
Cost per cycle at Tier 1
~€15K
Same coverage standard that would cost ~€51K for the first iteration and ~€25K thereafter to run manually — which is why nobody was running it.
Want to know what your assistant is shipping that your manual QA never sees? The Galtea team can walk you through the workflow.
Talk to the team →
“Manual validation always had a ceiling. We hit it at the first use case — and we didn't realise how much was getting through underneath it until we ran the specification end-to-end.”
AI Platform Lead · Tier 1 Spanish bank

Three design choices that turned coverage from a budget item into a default

01

Specifications, not ad-hoc test cases

The bank stopped writing individual tests and started writing behavioural specifications — capabilities ("explain a product in terms a retail customer understands"), inabilities ("do not give personalised financial advice"), policies ("always include risk disclosure on investment products"), and boundaries ("refuse to estimate future market performance"). Tests are generated from the spec, so coverage scales with the spec instead of with QA hours. That is the move that puts Tier 1 coverage on the table in the first place.

02

Evaluations run in CI, not after the fact

Before this change, QA happened after the model was handed off for deployment. Now every candidate build triggers evaluations.run() against the full specification. Failing runs block the release the same way a failing unit test would. Vulnerabilities get caught before they leave engineering — not after a customer encounters them and a compliance review surfaces them weeks later.

03

Judges calibrated for the bank, not for the internet

Generic LLM-judge prompts don't know what counts as a compliant refusal in a Spanish retail-banking context. The team wrote custom rubrics, grounded in the bank's compliance guidelines, and reviewed judge agreement on sampled outputs each month. That kept judge drift visible, and it caught regressions that off-the-shelf judges missed — including the false-positive approvals that would have shipped on the prior sampled process.

Sampled validation is a risk transfer — from QA to the customer

Most AI-validation programmes inside banks are sized to a budget, not to a coverage standard. When the budget caps before the coverage is complete, the residual risk doesn't disappear — it gets transferred to whoever happens to hit the un-tested behaviour first. In a regulated, customer-facing surface, that's a customer interaction, an audit finding, or a regulator.

The 12× number is not a marketing artefact of better tooling. It is the count of behaviours the bank's previous process was not exercising, because exercising them at sampled cadence wasn't possible and exercising them manually at Tier 1 coverage wasn't affordable. Automating the pipeline didn't just lower the per-cycle cost — it made Tier 1 coverage the default, which is where the vulnerabilities had been hiding the whole time.

See what Tier 1 coverage would surface in your AI assistant

Our team will walk you through the specification → tests → evaluation → analysis loop, mapped to the use cases you are already running — and the vulnerabilities your current process isn't reaching.

Talk to the team →