
Stella Liu
Head of AI Applied Science

Amy Chen
Cofounder, AI Evals & Analytics
LLM-as-a-judge is widely used as a low-cost proxy for human or business ground truth, but uncalibrated judge scores can be statistically misleading, even reversing model rankings. This creates real production risk. Eddie Landesberg, an AI Evals researcher, introduces a calibration method to better align LLM-as-a-judge with human judgment and real-world decisions.
This deck is from a guest Lightning Lesson by Eddie Landesberg.
Check out Lightning Lesson the recording here.
Also check out Eddie's post "Your AI Metrics Are Lying to You".
Free
Get this free resource