Skip to content

Accuracy

How ContentRX measures accuracy

Three numbers describe how ContentRX scores against a fixed bar. They’re kept separate on purpose. A single “accuracy score” would hide the self-drift ceiling. The reporting format (measured numbers, 95% intervals, pending cells named honestly) follows the Model Cards pattern from Mitchell et al., 2019.

Snapshot updated 2026-05-07.

Measured system κ

The engine vs the held-out blind-labeled set

pending

no standards have completed the weekly κ series

Measured self-drift κ

Same panel relabeled blind, weeks apart

0.575

95% CI [0.313, 0.836] · n = 80

Design target κ

A design assumption, not a measurement

0.900

Design assumption · stated separately from measurements

Coverage

ContentRX evaluates against 49 standards. As a standard collects enough labelled cases it moves up the ladder: every verdict reviewed, then sampled review, then no per-verdict review. The per-standard numbers stay internal; this page reports the aggregate.

Every verdict reviewed
43
Sampled review
0
No per-verdict review
0

0 of 49 standards have enough data for a measured κ.

How the numbers are produced

Each weekday the engine evaluates checks against the standards library and the held-out blind-labeled cases. The system κ is the agreement rate between the engine and those held-out labels. The self-drift κ is the agreement rate when the same panel is relabeled blind, weeks apart, by the same rater. That number is the ceiling: the engine can’t score higher than a human scores against themselves. The 0.90 design target is a stated assumption, not a measurement.

Pending cells render as “pending”: never zero, never filled from the design target. Reporting a measurement-in-progress honestly is the whole point.

The calibration log below tracks κ movement, drift signals, and active refinement candidates.

Calibration log

New entries land roughly weekly. Each one covers the most recent κ movement, drift signals, override counts, and active refinement candidates. The format is templated on purpose. Week-to-week consistency is what makes drift in the writing detectable.