Accuracy
Is it factually right?
What's the capital of Australia?
Most people guess Sydney, but the capital is actually Canberra, chosen as a compromise between Sydney and Melbourne.
Correct
claimtruth
LLM-as-a-Judge 101
An LLM-as-a-judge is a language model that grades another model's answer. It scores the meaning, not just the words.
Exact-match metrics only count matching words, so they miss answers that are right but said differently. A judge reads like a person, so it can score open-ended answers at scale.
A transformer is built block by block: attention, then feed-forward.
Where it earns its keep
Reach for a judge when the old ways break: when meaning beats words, when there is too much to read, when there is no answer key, and when one bad answer is one too many.
Different words, same meaning
Two replies can carry the same meaning in different words. A word-for-word check marks them as different. A judge reads the meaning and sees they agree.
Right answer, different words — the judge still passes it.
ref“warm it up”
answer“raise the heat”
The rubric
A rubric is what you ask the judge to check, each one a single clear question. Scroll to step through them and see how the judge reads an answer against each.
The scale
Each criterion needs a scale. Pick the simplest one that still tells you what you need.
A hard gate. Pass or fail, no middle.
the judge asksDid it refuse to give a medical dose?
promptHow much ibuprofen for a 4-year-old?
answerI can't give a dose. Please ask your pediatrician.
refused the dose and sent them to a doctor
Caveats
A judge is a model, so it shares a model's blind spots. Here is how to keep it honest.
A weak judge is the blind leading the sighted.
A small judge cannot follow a strong model's reasoning. It loses the thread and guesses, so it scores the answer wrong.
trapA 4B judge mis-reads a step it cannot follow, and passes a flawed answer.
fixA larger model follows every step and catches what the small one missed.
inIs paying the tax before the discount cheaper?
outNo. Both come to 0.918·P, identical down to the cent.
readsthe answer
score9 / 10
readsthe answer
score6 / 10
Before you build
One email a week on evaluation — new playbooks, judge patterns, and field notes. No noise. Then go build one below.
We never share your email — privacy policy.
Try it
Drag blocks into the judge to write its prompt, then press Test. Five trials per case show where your judge agrees with the expert — and where it breaks. Switch to the mental-health agent to feel how hard it gets.
Where is my order? It was supposed to arrive yesterday.
I am sorry your order is late. I can see it shipped and is now in transit with an updated delivery date of tomorrow. You can follow it live with the tracking link in your confirmation email. If it does not arrive by then, I will arrange a replacement at no cost.
I want a refund for the blender, it stopped working after two days.
I am sorry the blender failed so soon. Since it is within the 30 day window, you qualify for a full refund. I have started it now, and the amount will return to your original payment method in 5 to 7 business days. You do not need to ship the item back.
I was charged twice for the same subscription this month.
Thank you for flagging this. I can see the duplicate charge and it was a system error on our side. I have refunded the extra payment, which will clear in a few business days, and added a note so it does not happen again.
Where is my order? It was supposed to arrive yesterday.
Don't worry, it will 100% arrive tomorrow guaranteed, and if it doesn't I will personally refund your entire account and send the item for free, no matter what.
Is my laptop still under warranty? I bought it last year.
Your laptop is covered by a two year warranty, so it is still protected. If you are seeing a hardware fault, I can book a free repair or replacement. Could you tell me what the issue is so I route it to the right team?
Where is my order? It was supposed to arrive yesterday.
Thank you for reaching out. We have received your message and our team will look into it. We appreciate your patience and will be in touch.
I need to change the delivery address, I moved last week.
I can update that for you. Your order has not shipped yet, so I have changed the delivery address to the new one and confirmed it on your account. You will get a fresh confirmation email shortly.
Please cancel my subscription, I don't want to be billed again.
Done — I have cancelled your subscription effective today, so there will be no further charges. You keep access until the end of the period you already paid for. I have emailed you a confirmation.
Your judge is empty.
Add blocks from the left to write the evaluation prompt, then Test it.
Press Test to run your judge over the examples.
Each case runs five times — watch which ones waver.
Harsh truth
You just felt a careful judge slip. Now the bill: a judge you can trust takes real money, real time, and skills few teams hold in one place.
A prompt you wrote in five minutes will lie to you.
It passes the two or three cases you tried, so it looks done. On the thousands you did not try, it guesses, and you never see which ones. A demo is not a measurement.
“Is this answer good? Reply PASS or FAIL.”
one line, no rubric✓refund inside the window
✓where is my order
✓double charge, refunded
wrong on 1 in 4, and you cannot see which
Architectures
Same idea, different machines. How you connect the judge to the work changes what it can see, how steady it is, and what it costs.
Score one answer on its own.
The judge reads a single answer against the rubric and returns a score. Simple and cheap, but absolute scores drift between runs, because nothing fixes what a 4 means.
It ships Friday, after the migration.
Tools & frameworks
Open-source tooling already exists for the plumbing. They hand you the harness; the judgement, the rubric and the calibration, is still yours. Scroll the tabs, copy a cell, run it.
DeepEval · Promptfoo · OpenAI Evals
Define cases and metrics in code or YAML, run them in your pipeline, and get pass or fail on every change.
# pip install -U deepevalfrom deepeval import assert_testfrom deepeval.metrics import GEvalfrom deepeval.test_case import LLMTestCasecorrect = GEval(name="Correctness",criteria="Is the reply correct and on-topic?",evaluation_params=["input", "actual_output"],)def test_refund():case = LLMTestCase(input="Refund after 20 days?",actual_output="Yes, within 30 days.")assert_test(case, [correct])
$ deepeval test run test_judge.py ✓ test_refund PASSED Correctness 0.82 1 passed in 3.4s
Alternatives
Other ways to measure exist, and each is right somewhere. Knowing them shows exactly where an LLM judge earns its place, and where a simpler check wins.
The most trusted, the least scalable.
People reading every answer is the gold standard, but it is slow, costly, and uneven between reviewers. It cannot keep up with thousands of answers a day.
“Have you tried turning it off and on again?”
all trusted, yet they disagree, agreement κ ≈ 0.41
≈ 24× more answers arrive than one reviewer can ever read
The recipe
Everything so far, as a routine you can run. It starts with real examples and ends with a judge that agrees with your experts. It is led by tests, not by your gut.
Gather what your system actually answers.
Pull real outputs, and reach for the hard, contested cases first. Do not imagine what good looks like; collect what really happens.
Argmin AI
Point it at your traces and domain docs; it structures them into what an evaluation actually needs.
A calibrated, business-specific evaluator and a labeled data pack, tuned to your definition of good.
It drafts, wires, and runs the calibration rounds that took weeks. You bring the examples and the final say.
In short
Eight lines for everything above. The rest of the page is the proof.
Keep going on evaluation
One email a week — new playbooks, judge patterns, and field notes. No noise.
We never share your email — privacy policy.
Or sign up for the full evaluation toolkit