Skip to content

June 20267 min read

The case for verifiable human judgment

Why signed provenance, trust scores and live agreement turn trust from a promise into something a reviewer can check

PathWize

Key takeaways

Frontier models are increasingly trained and evaluated on the judgment of human experts, yet that judgment is one of the least verifiable inputs in the entire pipeline. Most labs simply trust their vendor that the right people did the work, the right way, without undisclosed shortcuts.

This is a solvable problem. A tamper-evident, per-task audit trail, trust scores for both experts and individual data batches, and inter-rater agreement computed live rather than after the fact turn data quality from a claim into something a reviewer can independently check.

  • Provenance is a per-task, signed record of who did what, when, and with how much model assistance.
  • Trust scores make the health of an expert and a batch visible while work is in progress.
  • Live inter-rater agreement flags divergence before a batch ships, not after.
  • A reproducible lineage bundle lets an external auditor replay and verify the data.

What is data provenance in AI training?

Data provenance is the documented history of a dataset: where each item came from, who produced or labeled it, under what instructions, and what was done to it before it reached a training run. For human-generated data, provenance answers the questions a careful reviewer would ask, but usually cannot: who made this judgment, what did they see, and how do I know it is real?

A model weight is reproducible; a human judgment is not. Once a rating, ranking or rationale is recorded, there is normally no trace of the person, the context or the process behind it. When the only artifact is the final label, quality control collapses into spot checks and good faith. That is fine for low-stakes labelling and unacceptable when the data shapes a model used in medicine, law, finance or security.

Organizations regularly using generative AI
Early 202333%
Early 202465%
Source: McKinsey, The state of AI, 2024 路 More AI-assisted judgment in the pipeline raises the need to verify it

How to make human-labeled data verifiable

Verifiability comes from capturing the process while the work happens, not reconstructing it afterwards. If every task event, assignment, draft, edit, rationale and submission, is written to a hash-chained, signed log, the record becomes tamper-evident. Altering a single row breaks the chain, so a reviewer can confirm that what they are looking at is what actually happened.

On top of that record, compute trust scores for each expert and each data batch and surface them live. A batch stops being a black box that passes or fails at the end; its health is visible as it is produced. Overlapping a fraction of assignments across multiple experts lets you measure agreement in flight and flag divergence as it appears.

  • Sign and hash-chain every task event so the trail is tamper-evident.
  • Record model assistance explicitly: what the AI drafted vs. what the human decided.
  • Score experts and batches continuously, not just at delivery.
  • Overlap assignments to compute live inter-rater agreement.
  • Export a per-batch lineage bundle that an outsider can replay.

What to ask a data vendor about provenance

If you buy human data, the fastest way to gauge a vendor's seriousness is to ask how they would prove a delivered batch to a hostile auditor. Vague answers about quality processes are a red flag; concrete answers about per-task records are not.

Use these questions as a starting point in vendor due diligence.

  • Can you produce a per-task audit trail for any item in this batch?
  • Is that trail tamper-evident, and how would I detect a changed record?
  • How do you record and disclose AI assistance on each task?
  • Do you measure inter-rater agreement live, and what happens when it drops?
  • Can you export a provenance bundle that maps to AI-Act Annex IV fields?

Withstanding adversarial review

The real test of any provenance system is whether it survives someone trying to break it. A reproducible lineage bundle, who did what, when, with which model assistance, at what level of agreement, signed end to end, should withstand a hostile reviewer, not just a friendly one.

That is the bar worth holding yourself to, because it is increasingly the bar that regulated buyers will be held to as well. Provenance is not a certificate you show once; it is a trail any auditor can replay against the delivered data.

Frequently asked questions

What is data provenance in machine learning?

Data provenance is the documented, verifiable history of a dataset, recording where each item came from, who produced or labeled it, under what instructions, and what processing it underwent before training. For human-labeled data it captures who made each judgment and how.

Why is human-labeled AI data hard to trust?

Because the usual artifact is only the final label, with no record of who produced it, what they saw, how long they took, or whether they used an undisclosed tool. Without that record, quality control reduces to spot checks and good faith.

How do you verify the quality of training data?

Capture the process while it happens: a signed, tamper-evident per-task audit trail, trust scores computed for experts and batches, live inter-rater agreement on overlapping assignments, and an exportable lineage bundle an outside auditor can replay.

What is a data lineage bundle?

A per-batch package that records who produced the data, when, with what model assistance, and at what level of inter-rater agreement, signed end to end so it can be independently replayed and checked, and mapped to regulatory documentation fields.

DH

Dr. Helena Vogt

Head of Research

Helena leads research at Pathwize on evaluation, provenance and the methods that make expert judgment verifiable.