Skip to main content

5th Grade Summary

An LLM judge can say an answer sounds good.

Kam also needs to know if the answer used the right route, team, sport, source, data freshness, and denominator.

Those checks should be deterministic.

Kam asks the judge only after the facts are checked.

Raw LLM judging is not enough for Kam.

That does not mean LLM judges are useless. It means they should sit in the right place in the ladder. They are helpful for usefulness, clarity, and nuanced answer quality. They are not the best first tool for checking whether the answer used the right team, mixed source families, skipped the required hot read, or failed to expose the historical denominator.

What deterministic means

A deterministic grader checks facts that should not depend on taste.

Examples:

  • Did the route match the workload?
  • Did the answer resolve the correct sport?
  • Did it use the right team or player?
  • Did it separate sportsbook odds from prediction-market context?
  • Did it load the required hot read?
  • Did it show as_of freshness?
  • Did it include the games behind an aggregate trend?
  • Did it avoid agentic workflow when normal chat should answer quickly?

These are not vibes.

They are product contracts.

What should be deterministic

Check
Route
Deterministic rule
Expected workload equals actual workload
Why an LLM judge is too late
A fluent wrong route can still sound helpful
Check
Entity
Deterministic rule
Team, player, game, or market matches expected entity
Why an LLM judge is too late
The judge may miss sports-specific identity drift
Check
Source
Deterministic rule
Sportsbook and prediction-market evidence are separated
Why an LLM judge is too late
Mixed sources can create false confidence
Check
Freshness
Deterministic rule
Required timestamp or stale warning is present
Why an LLM judge is too late
Staleness is a contract, not a writing preference
Check
Denominator
Deterministic rule
Aggregate trend names sample, games, and dates
Why an LLM judge is too late
Trend claims need auditability
Check
Tool policy
Deterministic rule
Disallowed tools or workflows were not used
Why an LLM judge is too late
Automation should follow bounded rules
Check
Answer shape
Deterministic rule
Required caveat and next check are present
Why an LLM judge is too late
Users need visible uncertainty before action

Takeaway: The judge should evaluate the answer after Kam proves the answer was eligible to be trusted.

The ladder

Visual artifact

Kam judge ladder

A good eval path fails as close to the product contract as possible.

  1. 01scope

    Contract grader

    Checks route, entity, workload, required fields, and forbidden paths.

  2. 02evidence

    Source grader

    Checks sportsbook, prediction-market, hot-read, freshness, and denominator evidence.

  3. 03answer

    LLM judge

    Checks usefulness, clarity, calibration, and whether the answer would help a human decide next steps.

  4. 04answer

    Human approval

    Approves the label, fixture, rubric, or release decision when judgment matters.

Do not spend judge tokens on answers that already failed deterministic rules.

Why this matters in sports

Sports questions are compact.

The user may ask:

Why did this move?

That question is only answerable if the system knows what "this" means. It needs the selected event, market, book, current line, prior line, timestamp, saved read, and source family. If the assistant guesses, the answer can become persuasive fiction.

The deterministic grader should fail the answer before style review.

Before and after

Before:

LLM answer -> LLM judge -> score

After:

trace -> route grader -> source grader -> denominator grader -> judge -> human approval

The after version is slower to design but faster to operate. It gives engineers a precise failure code instead of a vague low score.

Where each method is strongest

Route/entity checks

Deterministic

Freshness/source checks

Deterministic

Usefulness review

Judge helps

Final promotion

Human

Takeaway: The right mix is not anti-judge. It is judge-after-contract.

What the LLM judge should do

The judge still matters.

It should answer questions like:

  • Did the answer explain uncertainty in a useful way?
  • Did it help the user decide whether to wait, pass, save, or investigate?
  • Did it avoid overconfident betting language?
  • Did it connect the source evidence to the user question?
  • Was the answer concise enough for the product surface?

Best use of KamJudge

Usefulness

Would this answer help a human make a better next check?

Calibration

Does confidence match source quality, freshness, and missing data?

Communication

Is the answer clear without hiding caveats or overloading the user?

Takeaway: KamJudge is most valuable when deterministic evidence has already narrowed the question.

The lesson

The better Kam framework makes graders boring on purpose.

Every deterministic contract that can be checked without a judge should be checked without a judge. That lowers cost, improves diagnosis, and keeps LLM review focused on judgment rather than bookkeeping.

The next action is to expand grader coverage around the highest-risk labels: wrong route, unresolved sport, mixed sources, stale hot reads, missing denominator, and unsafe workflow escalation.

Read next

Related field notes

View all posts