Golden datasets for regulated AI: six Q&A frameworks tested
We benchmarked six Q&A generation frameworks (DeepEval, Giskard, LangChain, LlamaIndex, RAGAS, Galtea) on the same gpt-4.1 against the same calibrated judges. Confidence scores spread 17.8 points across an otherwise controlled comparison. The framework with the highest diversity score, RAGAS at 0.723, came last on confidence at 0.760, because its "diversity" was inflated by malformed and hypothetical questions with no source-document ground truth. Giskard wrote English questions from Spanish sources 46% of the time. For regulated or multilingual deployments, validity and language fidelity decide more than phrasing variety. The simplest pre-ship test stays the same: read 30 of the generated questions before you ship.
Duarte Moura
·
April 30, 2026
·
23 minutes