What is a good inter-rater agreement score?

It depends on the task and the measure, and harder, more subjective tasks naturally show lower agreement. What matters more than a single threshold is tracking agreement by task type and acting when it drops.

Why measure inter-rater agreement during labeling, not after?

Because an end-of-run number arrives too late to fix anything and hides where the problem is. Live agreement, segmented by task type, lets you correct ambiguous instructions before errors spread through the batch.

Inter-rater agreement: a live quality signal for data labeling

Q: What is inter-rater agreement?

A measure of how consistently independent raters assign the same label to the same item. It indicates whether a labelling task is well defined and whether raters interpret it the same way.

Q: How do you measure inter-annotator agreement?

With simple percent agreement or chance-corrected statistics such as Cohen's kappa or Krippendorff's alpha, chosen based on the number of raters and the label type.

Key takeaways

Inter-rater agreement is one of the oldest quality signals in human labelling. Usually it is computed once, at the end, when the data is already delivered.

It should be a live signal instead. Computed in flight from overlapping assignments, agreement becomes an early-warning system rather than a post-mortem.

Inter-rater agreement measures how consistently independent raters agree.
Computed after delivery, it tells you what already went wrong.
Computed live, it lets you fix instructions before errors propagate.
Disagreement between qualified experts is information, not just noise.

What is inter-rater agreement?

Inter-rater agreement (also called inter-annotator agreement or inter-rater reliability) measures how consistently independent people assign the same label to the same item. It is the standard way to ask whether a labelling task is well defined and whether raters understand it the same way.

Common measures include simple percent agreement and chance-corrected statistics such as Cohen's kappa or Krippendorff's alpha. The right measure depends on the number of raters and the type of labels, but the underlying question is always the same: do qualified people, working independently, reach the same answer?

Interpreting Cohen's kappa (Landis & Koch)

Slight · 0.01-0.20

Fair · 0.21-0.40

Moderate · 0.41-0.60

Substantial · 0.61-0.80

Almost perfect · 0.81-1.00

Source: Landis & Koch, Biometrics, 1977

Why measuring agreement at the end is too late

An agreement number computed after delivery tells you that something went wrong, but not in time to do anything about it. The batch is already in someone's training run, and the only remaining options are expensive: rework, discard, or ship known-flawed data.

Worse, a single end-of-run number hides where the problem is. Low agreement is usually concentrated in specific task types or ambiguous instructions, and an aggregate score smooths that detail away.

How to compute agreement in flight

By overlapping a fraction of assignments across multiple experts and computing agreement continuously, you get a running read on a batch's health while it is still being produced. When agreement on a task type drops, you can pause, inspect, and act before the error spreads through the rest of the batch.

Low agreement is usually a sign of ambiguous instructions rather than bad experts, so catching it early often means a quick guideline fix, not a costly redo.

Overlap a sample of items across multiple independent experts.
Compute agreement continuously, segmented by task type.
Alert when agreement on a segment drops below a threshold.
Treat low agreement first as an instruction problem to investigate.

Turning disagreement into a signal

When two qualified experts disagree, that disagreement is information. Surfacing it immediately, routing it to review, and feeding the resolution back into the guidelines makes the whole batch converge toward quality as it is built, instead of being graded after the fact.

Used this way, inter-rater agreement stops being a report you read at the end and becomes a control you steer with throughout.

Inter-rater agreement as a live signal

Key takeaways

What is inter-rater agreement?

Why measuring agreement at the end is too late

How to compute agreement in flight

Turning disagreement into a signal

Frequently asked questions