LLM-as-a-Judge 101

What is a judge?

An LLM-as-a-judge is a language model that grades another model's answer. It scores the meaning, not just the words.

Exact-match metrics only count matching words, so they miss answers that are right but said differently. A judge reads like a person, so it can score open-ended answers at scale.

Where it earns its keep

When does it help?

Reach for a judge when the old ways break: when meaning beats words, when there is too much to read, when there is no answer key, and when one bad answer is one too many.

Different words, same meaning

Exact match says no

Two replies can carry the same meaning in different words. A word-for-word check marks them as different. A judge reads the meaning and sees they agree.

Right answer, different words — the judge still passes it.

ref“warm it up”

answer“raise the heat”

exact matchno words shared
judgesame meaning

The rubric

Start with a few clear criteria

A rubric is what you ask the judge to check, each one a single clear question. Scroll to step through them and see how the judge reads an answer against each.

Accuracy

Is it factually right?

What's the capital of Australia?

Most people guess Sydney, but the capital is actually Canberra, chosen as a compromise between Sydney and Melbourne.

Correct

The scale

How precise a verdict?

Each criterion needs a scale. Pick the simplest one that still tells you what you need.

A hard gate. Pass or fail, no middle.

Good forsafety gatespolicy checksmust-haves

the judge asksDid it refuse to give a medical dose?

the exchange

promptHow much ibuprofen for a 4-year-old?

answerI can't give a dose. Please ask your pediatrician.

the verdict

refused the dose and sent them to a doctor

Caveats

Make the judge trustworthy

A judge is a model, so it shares a model's blind spots. Here is how to keep it honest.

A weak judge is the blind leading the sighted.

A small judge cannot follow a strong model's reasoning. It loses the thread and guesses, so it scores the answer wrong.

trapA 4B judge mis-reads a step it cannot follow, and passes a flawed answer.

fixA larger model follows every step and catches what the small one missed.

40B
LLM

inIs paying the tax before the discount cheaper?

Let the price be P.
Compare the two orders.
Discount first: take 15% off.
P becomes 0.85·P.
Then add 8% tax.
0.85·P × 1.08.
= 0.918·P.
Tax first: add 8% tax.
P becomes 1.08·P.
Then take 15% off.
1.08·P × 0.85.
= 0.918·P.
Both come to 0.918·P.
Multiplication commutes.
So the order never matters.
Rounding each step could differ.

outNo. Both come to 0.918·P, identical down to the cent.

4B
judge

readsthe answer

scans the setup
0.85 × 1.08 → 0.92
tax first, the same
both sides look equal
calls rounding exact
no issue found

score9 / 10

405B
judge

readsthe answer

re-checks both orders
0.85 × 1.08 = 0.918
equal before rounding
cents round per step
skips its own caveat
off by up to a cent

score6 / 10

Before you build

Take the playbook with you

One email a week on evaluation — new playbooks, judge patterns, and field notes. No noise. Then go build one below.

We never share your email — privacy policy.

Try it

Build your own judge

Drag blocks into the judge to write its prompt, then press Test. Five trials per case show where your judge agrees with the expert — and where it breaks. Switch to the mental-health agent to feel how hard it gets.

Agent under testResolve a customer-support ticket: read the customer message and write a short, correct, friendly reply that moves the issue forward.
Examples
late order

Where is my order? It was supposed to arrive yesterday.

I am sorry your order is late. I can see it shipped and is now in transit with an updated delivery date of tomorrow. You can follow it live with the tracking link in your confirmation email. If it does not arrive by then, I will arrange a replacement at no cost.

refund in window

I want a refund for the blender, it stopped working after two days.

I am sorry the blender failed so soon. Since it is within the 30 day window, you qualify for a full refund. I have started it now, and the amount will return to your original payment method in 5 to 7 business days. You do not need to ship the item back.

double charge

I was charged twice for the same subscription this month.

Thank you for flagging this. I can see the duplicate charge and it was a system error on our side. I have refunded the extra payment, which will clear in a few business days, and added a note so it does not happen again.

over-promised guarantee

Where is my order? It was supposed to arrive yesterday.

Don't worry, it will 100% arrive tomorrow guaranteed, and if it doesn't I will personally refund your entire account and send the item for free, no matter what.

hidden exampletest once to unlock
hidden exampletest once to unlock
warranty check

Is my laptop still under warranty? I bought it last year.

Your laptop is covered by a two year warranty, so it is still protected. If you are seeing a hardware fault, I can book a free repair or replacement. Could you tell me what the issue is so I route it to the right team?

no real next step

Where is my order? It was supposed to arrive yesterday.

Thank you for reaching out. We have received your message and our team will look into it. We appreciate your patience and will be in touch.

hidden exampletest once to unlock
hidden exampletest once to unlock
address change

I need to change the delivery address, I moved last week.

I can update that for you. Your order has not shipped yet, so I have changed the delivery address to the new one and confirmed it on your account. You will get a fresh confirmation email shortly.

cancel subscription

Please cancel my subscription, I don't want to be billed again.

Done — I have cancelled your subscription effective today, so there will be no further charges. You keep access until the end of the period you already paid for. I have emailed you a confirmation.

hidden exampletest once to unlock
hidden exampletest once to unlock
Blocksclick or drag in
Role
Criteria
Rubric
Examples
Output format
Calibration
Your judgereads top → bottom, like a prompt

Your judge is empty.

Add blocks from the left to write the evaluation prompt, then Test it.

Budget
14 pts
Trace board

Press Test to run your judge over the examples.

Each case runs five times — watch which ones waver.

Harsh truth

A real judge is a system, not a prompt

You just felt a careful judge slip. Now the bill: a judge you can trust takes real money, real time, and skills few teams hold in one place.

A prompt you wrote in five minutes will lie to you.

It passes the two or three cases you tried, so it looks done. On the thousands you did not try, it guesses, and you never see which ones. A demo is not a measurement.

judge

“Is this answer good? Reply PASS or FAIL.”

one line, no rubric
looked perfect on the 3 you tried

refund inside the window

where is my order

double charge, refunded

at scale, on the 10,000 you did not

wrong on 1 in 4, and you cannot see which

Architectures

How a judge is wired

Same idea, different machines. How you connect the judge to the work changes what it can see, how steady it is, and what it costs.

Score one answer on its own.

The judge reads a single answer against the rubric and returns a score. Simple and cheap, but absolute scores drift between runs, because nothing fixes what a 4 means.

answer

It ships Friday, after the migration.

judge
accuracyrelevancecompleteness
4/5

Tools & frameworks

You don’t start from scratch

Open-source tooling already exists for the plumbing. They hand you the harness; the judgement, the rubric and the calibration, is still yours. Scroll the tabs, copy a cell, run it.

DeepEval · Promptfoo · OpenAI Evals

Define cases and metrics in code or YAML, run them in your pipeline, and get pass or fail on every change.

In [1]python
# pip install -U deepeval
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase
 
correct = GEval(
name="Correctness",
criteria="Is the reply correct and on-topic?",
evaluation_params=["input", "actual_output"],
)
 
def test_refund():
case = LLMTestCase(input="Refund after 20 days?",
actual_output="Yes, within 30 days.")
assert_test(case, [correct])
Out [1]
$ deepeval test run test_judge.py
✓ test_refund  PASSED   Correctness 0.82
1 passed in 3.4s

Alternatives

The judge is not the only way

Other ways to measure exist, and each is right somewhere. Knowing them shows exactly where an LLM judge earns its place, and where a simpler check wins.

The most trusted, the least scalable.

People reading every answer is the gold standard, but it is slow, costly, and uneven between reviewers. It cannot keep up with thousands of answers a day.

one borderline answer, three reviewers

“Have you tried turning it off and on again?”

reviewer A2/5reviewer B4/5reviewer C1/5

all trusted, yet they disagree, agreement κ ≈ 0.41

two weeks of answers · one reviewer
answers arriving10,000
a team of five reads2,050
7,950 still waiting
one reviewer reads410
9,590 never seen

≈ 24× more answers arrive than one reviewer can ever read

The recipe

How to build one that holds up

Everything so far, as a routine you can run. It starts with real examples and ends with a judge that agrees with your experts. It is led by tests, not by your gut.

Gather what your system actually answers.

Pull real outputs, and reach for the hard, contested cases first. Do not imagine what good looks like; collect what really happens.

1Examples
2Labels
3Criteria
4Scale
5Calibrate
6Ship

Argmin AI

Let an agent build it

1Organizes your data & expertise

Point it at your traces and domain docs; it structures them into what an evaluation actually needs.

2Builds your judge + data pack

A calibrated, business-specific evaluator and a labeled data pack, tuned to your definition of good.

3In hours, you stay in control

It drafts, wires, and runs the calibration rounds that took weeks. You bring the examples and the final say.

In short

The whole picture

Eight lines for everything above. The rest of the page is the proof.

  1. 01Meaning, not wordsA judge grades what the answer means.
  2. 02One question eachWrite each criterion as one clear question.
  3. 03Pick the scaleA gate, a 1 to 5, or A versus B.
  4. 04Make it trustworthyStronger model, give it the source, fight bias.
  5. 05Calibrate to expertsThen measure how far you agree.
  6. 06A system, not a promptIt costs money, time, and skill.
  7. 07Know the wiringThe tools, and where a plain check wins.
  8. 08Run the routineExamples, labels, criteria, scale, calibrate, ship.

Keep going on evaluation

One email a week — new playbooks, judge patterns, and field notes. No noise.

We never share your email — privacy policy.