Key takeaways
Most benchmarks grade a model from its final text. That is fast and cheap, but it flattens the reasoning, tool use and edge cases that actually decide whether an answer is correct.
A good evaluation has the opposite property: it is hard to fake. It rewards correctness that survives scrutiny, uses credentialed graders against a domain rubric, and is built to resist gaming, including submissions where an undisclosed model did the work.
- Grade reasoning and process, not just the final string.
- Use credentialed experts and a failure-mode-specific rubric.
- Treat disagreement between graders as signal, not noise.
- Build defenses against undisclosed-AI submissions from the start.
Why text-only grading flatters models
A confident, well-formatted wrong answer often scores better than a correct but terse one. Text-only grading optimizes for the appearance of correctness, which is exactly what modern models are best at producing.
On hard domains, the difference between right and plausible-but-wrong is the whole game, and it usually lives in the reasoning rather than the conclusion. If your grader never inspects the reasoning, your score measures fluency, not capability.
How to design an expert-graded evaluation
Robust evaluation starts from the failure modes specific to a domain and builds a rubric that targets them. Credentialed graders apply that rubric, and the hardest items are reviewed by more than one expert so that disagreement itself becomes a measurable signal.
The goal is a score that rises only when capability genuinely rises, not when a model gets better at looking right.
- Enumerate domain-specific failure modes before writing tasks.
- Write a rubric that scores those failure modes explicitly.
- Recruit graders with verifiable credentials in the domain.
- Double-grade hard items and track inter-rater agreement.
- Refresh items regularly so the set cannot be memorized.
How to catch undisclosed-AI submissions
When the task is to measure human judgment, an undisclosed model in the loop poisons the result. Defenses combine behavioral signals, timing and telemetry, with content heuristics that flag text likely to be model-generated.
Detection is never perfect, but raising the cost and the risk of undisclosed AI is usually enough to keep a benchmark honest. The most robust deterrent is process transparency: when every step is recorded, shortcuts are easier to spot.
Common evaluation pitfalls to avoid
Most broken evaluations fail in predictable ways. Avoiding them is half the work.
- Test-set leakage into training or prompts (the score becomes memorization).
- Single-grader subjectivity with no agreement measurement.
- Rubrics that reward length or confidence instead of correctness.
- Static benchmarks that vendors quietly optimize against over time.
- No record of how a grade was reached, so disputes cannot be resolved.
