Hey! I'm Jar — Manvendra's AI sidekick. Want me to show you around?

Back to glossary
AI

LLM Evaluation

The practice of measuring whether an LLM-based system output is correct, helpful, safe, and consistent — usually against a curated golden set.

Full definition

LLM evaluation is the discipline of measuring quality in AI systems where there is rarely one right answer. A robust evaluation stack uses multiple layers: automated checks (regex, JSON validation, factual lookups), LLM-as-judge for subjective dimensions, human review on a sampled subset, and live user feedback. The output of all four feeds a quality score that is tracked release over release. A "golden set" of 100–500 carefully labeled examples is the foundation of any production eval system. Without it, prompt iteration is guesswork.

Frequently asked

What is a golden set?

A curated collection of input-output pairs labeled by experts and used as a benchmark for every prompt or model change.

Should you use LLM-as-judge?

Yes, for subjective dimensions, but always anchor to human review on a sample to catch judge drift.