Topic

LLM evaluations

The core craft of measuring LLM performance: metric design, test data generation, LLM-as-a-judge methods, evaluation frameworks, and how to translate model behavior into business confidence.

Posts on this topic

LLM evaluations

How to optimize your LLM Judge for AI evaluations (And why most teams get it wrong)

Most teams building LLM evaluation pipelines spend a lot of time on the judge itself, which model to use, how to write the rubric, and which dimensions to score. Almost none of that effort goes into evaluating whether the judge is actually right.

How to create a solid set of test cases to evaluate your GenAI system

Three approaches to test-case generation for GenAI systems — red teaming with curated attack databases, gold-standard generation for RAG and tool calling, and synthetic-user simulation for multi-turn conversations. With concrete examples for each.

Exploring state-of-the-art LLMs as Judges

Benchmarks for Glider, FlowJudge, Phi-3.5-mini, Selene, GPT-4o, and Claude 3.5 Sonnet as judges across general rubrics and red-team safety tasks. Where small fine-tuned models hold up, where they don't, and what the latency and memory tradeoffs look like.

Why Evaluation is the Key to Scaling Generative AI

Most enterprise GenAI projects stall when they try to scale past the MVP. The reason is evaluation strategy. Here's the four-stage view of where teams get stuck, and what it takes to move past Stage 3.

Galtea Team

February 24, 2025

4 minutes