LLM Evaluation
The practice of measuring whether an LLM-based system output is correct, helpful, safe, and consistent — usually against a curated golden set.
Full definition
LLM evaluation is the discipline of measuring quality in AI systems where there is rarely one right answer. A robust evaluation stack uses multiple layers: automated checks (regex, JSON validation, factual lookups), LLM-as-judge for subjective dimensions, human review on a sampled subset, and live user feedback. The output of all four feeds a quality score that is tracked release over release. A "golden set" of 100–500 carefully labeled examples is the foundation of any production eval system. Without it, prompt iteration is guesswork.
Frequently asked
What is a golden set?
A curated collection of input-output pairs labeled by experts and used as a benchmark for every prompt or model change.
Should you use LLM-as-judge?
Yes, for subjective dimensions, but always anchor to human review on a sample to catch judge drift.