Golden datasets for regulated AI: six Q&A frameworks tested

We benchmarked six Q&A generation frameworks (DeepEval, Giskard, LangChain, LlamaIndex, RAGAS, Galtea) on the same gpt-4.1 against the same calibrated judges. Confidence scores spread 17.8 points across an otherwise controlled comparison. The framework with the highest diversity score, RAGAS at 0.723, came last on confidence at 0.760, because its "diversity" was inflated by malformed and hypothetical questions with no source-document ground truth. Giskard wrote English questions from Spanish sources 46% of the time. For regulated or multilingual deployments, validity and language fidelity decide more than phrasing variety. The simplest pre-ship test stays the same: read 30 of the generated questions before you ship.

Loading the Elevenlabs Text to Speech AudioNative Player...

Golden datasets for regulated AI: six Q&A frameworks tested

Most teams pick a Q&A generator the same way they pick a code formatter: run it, eyeball the output, ship the dataset. That works right up until the eval pipeline starts flagging failures three weeks into production and it turns out half the failures are questions the framework generated in English from Spanish source material, or hypothetical scenarios with no ground truth in the documents, or sentences mangled badly enough no real user would type them. The regression suite is testing the wrong thing, and nobody notices until someone spends an afternoon reading the test data.

The benchmark below put six frameworks (DeepEval, Giskard, LangChain, LlamaIndex, RAGAS, Galtea) against six enterprise documents in English and Spanish. Every framework ran on the same gpt-4.1 at temperature 0, with the same calibrated judges and the same per-document targets, so whatever shows up in the numbers is about the framework, not the model. What follows is the trade-off map and a recommendation for each framework. The reproduction methodology (target formula, chunking details, judge calibration) is in the companion piece on synthetic data generation tools.

What is a golden dataset?

A golden dataset is the ground truth a production AI or LLM system is evaluated against. In machine learning more broadly, the term covers labeled validation corpora, regression-test references, and human-curated benchmark sets. For LLM evaluation in enterprise applications (regulatory review, customer support, document understanding), the dataset has three non-negotiable properties.

Every question must be answerable from the source document. If the document does not contain the information, the question cannot distinguish a retrieval system that found the right passage from one that did not. You end up measuring the model's ability to fabricate, not its ability to retrieve.

The expected answer must be unambiguous and deterministic. If two domain experts write different answers, the pass-fail test becomes judgment-dependent. That kills reproducibility, which is the entire point of regression testing.

The dataset must not introduce false negatives from malformed queries. A question a real user would never send will fail in evaluation, not because the system is broken, but because the test is noise.

Hypothetical questions, grammatically loose queries, and the occasional language slip are not inherently broken inputs. Real users misspell, ask "what if," and code-switch, and a synthetic generator that produces a small share of each is mirroring real behavior. What separates a useful generator from a noisy one is calibration. The frameworks producing these forms in this benchmark push them well past the realistic envelope: a cluster of fully malformed sentences stops being user noise and becomes randomness, and a hypothetical with no anchor in the source document stops testing reasoning and starts testing fabrication. The dataset's diversity score climbs while its usefulness collapses, and every one of those questions registers as an eval failure the moment a regression suite touches it.

What this benchmark does not cover

The benchmark ran on gpt-4.1 at temperature 0. Results on Claude Opus 4.6 or GPT-5.2 may differ, particularly on validity dimensions where stronger reasoning models can rescue some of the hypothetical-question problem post-generation. Six documents across two languages is a useful spread across document types (regulatory, legal, financial, corporate, economic), but it is not a claim about every domain. Scientific-paper-shaped sources (dense citation graphs, equation-heavy passages) are out of scope. Diversity was measured lexically and structurally, not cognitively; Bloom's Taxonomy complexity and document coverage are not captured here.

These datasets are scoped to evaluation, not training. Noise that would be a liability in a regression suite can be a regularizer in fine-tuning data, so do not port these conclusions to training-dataset selection. A framework that produces too-similar factoids for a golden dataset may produce perfectly useful variation for a supervised fine-tuning corpus.

The benchmark also isolated Galtea's Q&A generator from the rest of the Galtea platform. That is a deliberate choice to keep the comparison fair against standalone libraries. In practice, the versioning, metrics library, and evaluation workflow around the generator are part of why it fits regulated-industry deployments. The generator is a component; the platform is what you deploy into.

How each framework approaches Q&A generation

Calling all six of these "Q&A generators" flattens real architectural differences. A 30-second mental model of each, before the scores:

  • DeepEval runs a structured pipeline that extracts context via ChromaDB-backed embeddings, then generates Q&A pairs designed for RAG benchmarking. Output skews toward broader question-type coverage: procedural, comparative, causal.
  • Giskard exposes RAGET, a business-oriented testing framework. It builds a knowledge base from document chunks and generates traceable Q&A aimed at validation workflows rather than pure benchmarking.
  • LangChain uses LCEL (LangChain Expression Language) to compose a Q&A pipeline with pluggable chunking and a pluggable LLM backend. Composability is the point; the operator makes more of the decisions.
  • LlamaIndex is retrieval-driven. It builds document nodes, runs a vector search, and generates questions grounded in indexed content with source attribution baked in.
  • RAGAS uses SingleHopSpecificQuerySynthesizer, an evolution-based mutator that generates diverse question variations through iterative reasoning-depth and complexity shifts. The design assumption is that the operator filters the output before use.
  • Galtea is the outlier. Q&A generation is one component of an evaluation platform, constrained by a product specification that encodes the system's capabilities, inabilities, security boundaries, and policies. For this benchmark, the generator was isolated and run head-to-head with the five standalone libraries to keep the comparison fair.

The benchmark, condensed

Six documents across two languages:

Document Domain Language
EBA Keynote Speech (IIF Colloquium 2024) Banking regulation English
Apple Card Customer Agreement (Q3 2025) Financial products English
JPMorgan Chase Corporate Data (2024) Corporate governance English
Commercial Transportation Agreement Legal contracts English
Banco de España: Spanish Economic Growth Economic analysis Spanish
Banco de España: EU–Mercosur Trade Agreement Trade policy Spanish

Four things were held constant so the comparison actually measured the framework. Every library called gpt-4.1 at temperature 0 using its default recommended prompt (no custom tuning; the point is to test what you get out of the box, not how well an operator can paper over a bad default).

Eight quality dimensions per dataset. Seven are per-pair: fluency, clarity, conciseness, context consistency, contextual answerability, answer consistency, and language consistency. The eighth is a dataset-level composite diversity score built from Self-BLEU, Distinct-2, semantic similarity, and question-type entropy. Two of the pair-level dimensions are validity gates: if a question is not answerable from the source, or the answer does not resolve the question, the Q&A pair is unusable, regardless of how fluent it looks.

The judges themselves were calibrated against QGEval, a human-annotated benchmark for question generation. Accuracy against human labels ran from 0.82 (answer consistency) to 0.95 (conciseness). Perfect agreement is unreachable (human annotators disagree with each other a fair amount on the semantic dimensions), but identical judges scored every framework's output, so framework-versus-framework comparisons hold up.

The headline result

Dimension Galtea LangChain DeepEval LlamaIndex Giskard RAGAS
Fluency 1.000 1.000 1.000 1.000 1.000 0.571
Clarity 0.993 0.936 0.976 1.000 0.981 0.638
Conciseness 1.000 1.000 1.000 1.000 0.973 0.830
Context consistency 1.000 1.000 0.997 1.000 1.000 1.000
Contextual answerability 0.936 0.975 0.847 0.710 0.939 0.640
Answer consistency 0.967 0.957 0.945 0.926 0.815 0.837
Language consistency 0.980 0.845 0.795 0.921 0.542 0.836
Diversity 0.626 0.630 0.656 0.632 0.690 0.723
Confidence score 0.938 0.918 0.902 0.898 0.867 0.760

Fluency, conciseness, and context consistency saturate near 1.0 for every framework except RAGAS. At gpt-4.1 temperature 0, those dimensions measure the underlying model, not the framework. Ignore anyone who compares Q&A generators on fluency alone.

The dimensions that actually separate the field are answer consistency (Giskard 0.815 to Galtea 0.967), language consistency on Spanish documents (Giskard 0.542 to Galtea 0.980), and contextual answerability (LlamaIndex 0.710 to LangChain 0.975). These are the validity and fidelity metrics, and they are the ones that actually determine whether your golden dataset is usable.

One more observation before the qualitative failures. The framework that won the diversity score (RAGAS, 0.723) finished last overall at 0.760 confidence, 0.178 points behind Galtea. That inversion is the central finding of the benchmark, and the rest of this piece is about why.

The diversity trap

Diversity metrics reward lexical and structural variation. In principle, varied phrasing is valuable because it exposes the system under test to different surface forms. In practice, the cheapest way to inflate Self-BLEU and Distinct-2 is to produce noise.

RAGAS: diversity via malformed questions

RAGAS generates question variations through iterative mutation. The architecture assumes post-generation filtering is part of the workflow, and the docs are explicit about this. Without a filter step, the output looks like this:

"Whos mellodi hobbon n wats she do on the bored off directers?"
"Who Lori A. Beer be?"
"What UE do?"
"why they say investments an actions need like ramping up everywhere cuz of them Paris Agreement goals"

Every one of these inflates Distinct-2. A misspelling is, mechanically, a novel bigram. Every one also lowers Self-BLEU, because no two malformed questions phrase things the same way. The diversity score rewards noise. The downstream evaluator uses the noise as the test. A retrieval system might correctly parse "Who is Lori A. Beer?" and return an accurate answer, but because the eval harness compares the response to a reference generated from the mangled question, the pass rate drops. Your team pages someone. The bug is in the test data.

This is not a criticism of RAGAS as a library (it is a well-designed tool for producing candidate pools that a curation step filters). It is a warning about using it as if it produces final datasets. Wiring RAGAS straight into CI without a filter step is holding the tool wrong.

DeepEval: diversity via hypothetical questions

DeepEval's output is grammatically clean. Its diversity comes from broader question-type coverage: 20.3% procedural questions (versus 10.4% for Galtea), 7.4% comparative (versus 1.7%), and 3.0% causal (versus 0.4%). That is a legitimate diversity axis. Some of the output, however, is not answerable from the source document:

"If Nestlé S.A. faced a global crisis, how might its Chairman, Paul Bulcke, respond?"
"If JPMorgan Chase expanded into a new continent, which nationalities/orgs might lead its regional team?"
"Who succeeded Carlo Messina as CEO of Intesa Sanpaolo, if any, after 2025?"
"Si la UE duplicara exportaciones a Mercosur, ¿qué sectores europeos crecerían más y por qué?"

These are speculative queries. They are useful inputs for testing a model's reasoning capability, which is sometimes what you want. They are the wrong inputs for a RAG evaluation, because a RAG evaluation asks "did the retrieval find the right passage?" and for a hypothetical there is no right passage. The model either fabricates or refuses, and either outcome is noise, not signal. DeepEval's 0.847 contextual answerability, the second-lowest in the benchmark, is these questions showing up in the aggregate.

The honest framing

Lexical and type diversity have value when the dataset is for training or for a model reasoning benchmark. They are a liability when the dataset is for deterministic regression testing of a production system. The choice of where to sit on the precision-diversity trade-off is the most consequential architectural decision a Q&A generator makes, and most teams pick a generator without realizing the trade-off is happening.

Language drift is the first thing to filter on

Two of the six documents were in Spanish. The language-consistency numbers produced the widest spread in the entire benchmark:

Framework Language consistency (Spanish docs)
Galtea 0.980
LlamaIndex 0.921
LangChain 0.845
RAGAS 0.836
DeepEval 0.795
Giskard 0.542

Giskard writes English-language questions from Spanish source documents almost half the time. Feed a Banco de España trade-policy paper to its default configuration and roughly 46% of the questions come back in English. A Spanish-serving retrieval system will correctly decline to answer them, or worse, answer them in the wrong language, and every one of those responses registers as an eval failure. At that point the dataset is a language-detector test wearing a trench coat.

DeepEval (0.795) and LangChain (0.845) are less broken, but neither is safe if your production system serves non-English speakers. A framework below ~0.95 on this dimension introduces more noise than signal. Below ~0.80, the benchmark is not measuring your system's Spanish performance at all. It is measuring whatever combination of Spanish and English ended up in your Q&A pairs.

There is no workflow around 46% language drift. You either fix the generator or throw the dataset out. That disqualifies four of six frameworks before any other trade-off matters.

Why the spread is this wide

Every framework in this benchmark ran on the same gpt-4.1 at temperature 0. The judges were the same. The per-document targets were the same. The spread in confidence scores (Galtea 0.938, RAGAS 0.760, a 17.8-point gap) is entirely architectural.

That spread reflects who each framework was designed for.

Open-source libraries in this space optimize for generality. RAGAS is built for research pipelines where a filter step is assumed. DeepEval is built for RAG benchmarking where broad question-type coverage is a feature. LangChain is a composable toolkit assuming the operator will tune chunking and prompts. LlamaIndex is index-first, so questions are grounded in retrieved nodes rather than the raw source. Giskard is a traceability-oriented testing framework aimed primarily at English business validation flows. None of these are bad designs. They are designs aimed at different primary users, and each one optimizes cleanly for the user it has.

Galtea's generator sits inside a specification-driven evaluation platform. The core concept is the Specification, which encodes a product's capabilities, inabilities, security boundaries, and policies. Q&A generation is constrained by that specification: questions must be answerable from the documents the spec binds to, the expected answer has to be deterministic, and the output has to preserve the language of the source. Those constraints exist because the teams using Galtea are evaluating regulated, multilingual production systems where unverifiable test data is not an option, and the feedback loop between those teams and the generator's design has been tight.

The argument here is narrower than "commercial beats open source." Architectural choices reflect who gives feedback on the output. A generator whose users run a post-generation filter will tolerate noise. A generator whose users feed the output directly into a regulated production evaluation pipeline will not. Both are valid design trajectories. They produce different tools for different contexts. If your context is regulated, multilingual enterprise AI, the architectural trajectory that includes validity-gated generation and language preservation will fit your constraints better than one optimized for research-style candidate pools.

When each framework is the right choice

No framework wins every axis. A short, honest recommendation matrix, drawn from the benchmark data:

  • Pick RAGAS if you are doing research-style dataset generation and you have an automated filter or human review step between generation and use. It is the strongest option for producing a large candidate pool where you will keep 20% to 40% of the output. 
  • Pick DeepEval if your evaluation is RAG benchmarking with broader question-type coverage (procedural and comparative questions matter to you), your source documents are in English, and your test harness can tolerate hypothetical questions. Its ChromaDB-backed context extraction is the cleanest of the open-source five at RAG-shaped evaluation targets.
  • Pick LangChain if you want a composable pipeline that fits into an existing LCEL stack and you are willing to tune chunking and prompts yourself. LangChain scored second overall (0.918) and highest on contextual answerability (0.975). It rewards operator investment more than the others.
  • Pick LlamaIndex if your pipeline needs source-attributed, index-grounded questions by construction. The 0.710 contextual answerability is a weak point (retrieval-driven generation can latch onto irrelevant nodes), but if downstream evaluation requires passage-level citation, this is the cleanest fit.
  • Pick Giskard if your validation is English-only and business-process-oriented, and traceability to the source chunk is a hard requirement. The 0.542 Spanish language consistency is a stop sign for anything multilingual.
  • Pick Galtea if you need verifiable, production-ready evaluation datasets for regulated or multilingual deployments, and you want the generator's output constrained by a product specification rather than curated after the fact. The 0.938 confidence is driven by top-tier answer consistency (0.967) and language consistency (0.980). The trade-off is a lower lexical diversity score (0.626). If your evaluation needs phrasing variety more than it needs validity guarantees, supplement Galtea's output with a diversity-first tool.

What to do before your next golden dataset ships

Aggregate quality scores hide architectural trade-offs. The six frameworks in this benchmark occupy different positions on the same axis: factual precision and multilingual fidelity at one end, lexical diversity and question-type spread at the other. Every framework picks a corner. None wins every dimension.

If you are building a golden dataset for a regulated or multilingual production system, the sequence is: disqualify on language consistency if you serve non-English customers (Giskard, DeepEval, and LangChain all fall here at the thresholds that matter), then disqualify on contextual answerability if your eval is retrieval-grounded (LlamaIndex falls here), then decide whether you want a candidate-pool tool with a curation step (RAGAS) or a validity-gated specification-aligned generator (Galtea). Diversity is the last tiebreaker, not the first.

Whichever you pick, pressure-test before shipping. Two automated checks, under an hour combined.

Run a contextual answerability judge against every Q&A pair. For each pair, feed the source document and the question to a calibrated LLM judge and ask whether the document contains a direct answer. This is the single cheapest validity test, and the one that separates the frameworks in this benchmark most decisively.

Run a language-detection pass on every question and answer field from non-English source documents. fasttext or langdetect handles this in seconds. If more than 5% of questions from a non-English document have drifted to English, throw the dataset out or fix the generator.

Galtea runs both checks automatically as part of the evaluation pipeline: generated datasets are scored against the same eight dimensions described above, and failing items are flagged before they land in a regression suite. If you are running your own pipeline, the relevant reference in the docs is the specification-driven evaluations tutorial, which walks through how validity gates are encoded in a specification and applied at generation time.

And read 30 of the generated questions before you ship. Benchmarks compress what direct inspection exposes in five minutes.

Duarte Moura
AI Engineer

Hi, I’m Duarte, an AI Engineer based in Madrid, Spain. I work on safe AI validation at Galtea and have a strong interest in machine learning applications in healthcare.

Bring your eval pipeline. We'll show you where validity gates fit

A technical conversation for teams shipping LLM evaluations in regulated or multilingual production. Bring your source documents; we'll walk through how the specification model maps to your setup.

Talk to the Team →
On this page