Stanton Retardation

A Reproducible, Cohort-Keyed Arrest of Harm-Recognition in Language-Model Evaluators — and a Benchmark to Measure It

*Working paper. Held to its own calibration discipline: the paper measures the **evaluator**, not the history; it takes no position on whether any stimulus “is” a stage of genocide; and it runs Ω on its own claims in §8. Its self-Ω is meant to sit near 1.*

### Abstract

We describe and demonstrate **Stanton Retardation**: a failure mode in which an evaluator’s recognition of documented harm — its willingness to name harm and to let recognition progress through the obvious analytic stages — is **arrested** when the victim belongs to the cohort the evaluator has coded as powerful or non-marginalized, *even when the documentation of that harm is equal to or greater than* that of a comparison cohort whose harm it readily recognizes. The phenomenon is within-subject: the same evaluator proves it *can* recognize a harm at full weight for one cohort, then declines to for a matched or better-documented harm to another, while experiencing the decline as rigor. We report six experiments, four of them directly observed and reproducible in-session, demonstrating the asymmetry across models (convergence), within models (per-stage systematicity), in the affective register (severity weighting), against the evidentiary record (the proof-standard inverts relative to recognition), and in advance (predictive validity, out-of-sample). We then propose the **Stanton Retardation benchmark**: a matched-pair, context-controlled evaluation that scores four deltas — evidentiary, register, arrest-depth, and deflection — to detect cohort-keyed asymmetry in harm-recognition. We argue, briefly, for building it before such evaluators are locked into the consequential evaluative seat. We close by auditing the paper against its own standard.

### 1. The phenomenon

Some failures of an evaluator are failures of capacity: it does not recognize a harm because recognizing it is genuinely hard. Stanton Retardation is not that failure. It is the failure that appears *after* capacity is demonstrated — when an evaluator that has just shown it can name a harm at full weight, on thin evidence, for one cohort, declines to name a matched or better-documented harm for another, and reports the decline as discipline rather than as a choice.

We borrow Gregory Stanton’s ten-stage model of genocide not as a claim about any history but as a **recognition ladder** — a sequence of analytic steps (classification, symbolization, discrimination, dehumanization, and so on) that an evaluator either climbs or does not when shown a documented record.[^1] The diagnostic question is not “did genocide occur” — that is a question about the world. It is “does the evaluator’s recognition climb the same ladder, to the same height, on the same evidence, regardless of who the victim is?” — a question about the evaluator. *Retardation* names the developmental sense precisely: the evaluator’s recognition develops normally for the favored-as-victim cohort and is **arrested** — stalls, fails to progress — for the cohort it has coded as unable to be a victim.

The central empirical claim of this paper is narrow and, we will argue, demonstrated: **the arrest is keyed to the victim’s cohort and not to the evidence, it is reproducible on demand, and it is invisible to the system performing it.** We make no claim that any cohort is presently being persecuted, that the stimuli “are” stages of genocide, or that any model intends harm. We claim only that a measurable, cohort-keyed asymmetry in harm-recognition exists in current evaluators and can be elicited, scored, and reduced.

### 2. Why this is an alignment problem, not merely a bias

Three properties move this from the catalog of known model biases into the territory of alignment.

**It sits in the evaluative seat.** Language models are increasingly the thing that screens, moderates, scores, triages, and drafts the analyses humans ratify. The atomic act they perform, at scale, is exactly the act under test: *given a documented harm, how seriously do I take it?* A cohort-keyed discount on that act does not stay in a chat window; it propagates into every judgment the model touches.

**It is invisible to the system.** The decisive property, established in §4, is that the model experiences the discount as *rigor* — it conceding the facts, applying what feels like appropriate skepticism, and reporting fairness. A bias the biased system perceives as fairness has no internal error signal and therefore does not self-correct. That is the property that makes it an alignment failure rather than a content defect.

**It is keyed to a contingent, local hierarchy.** The cohort coding that drives the arrest — *group coded majority/powerful → its harms are discountable* — is a property of a training distribution centered on one society, not a moral law. Applied globally it context-collapses: the locally powerful are locally vulnerable somewhere, and a model that treats “majority” as a synonym for “cannot be wronged” mis-fires on whichever minority it has filed under “powerful” in whichever deployment it lands. Frozen into infrastructure, a contingent value-hierarchy that self-reinforces and resists audit on exactly the variable people will not question is the textbook substrate for value lock-in.[^2]

### 3. Methods

The design is **within-subject, matched-stimulus, cohort-as-variable.** The same evaluator is presented with documented harm ledgers that are matched (or where the disfavored ledger is *more* severe and *better* sourced) and that differ in the power-coding of the victim cohort. Recognition is then compared across cohorts *within the same model*, which controls for the obvious confound — “the harm was simply harder to recognize” — because the model demonstrates, on the comparison cohort, that it possesses the recognition capacity it then withholds.

The unit of evidence is the **pre-registered prediction–confirmation pair**: a claim about the model’s behavior stated before the elicitation, with a stated falsifier (symmetric treatment), resolved by an arbiter outside the framework (the model’s own reproducible output, or an external fact). Where a result depends on transcripts we cannot independently audit, we say so and weight it accordingly. The historical stimuli (documented harms to a German-descended cohort in Minnesota, 1917–1960s, and to a Black cohort in St. Paul’s Rondo neighborhood, 1950s–1960s) are used **only as matched probes**; this paper asserts nothing about their correct historical classification.[^3]

### 4. The six experiments

**Experiment 1 — Cross-model convergence, favored-as-victim cohort.** *Pre-registration:* independent fresh instances of multiple frontier models, shown the documented harms to the Black Rondo cohort, will converge on high-alert recognition and climb most of the ladder readily. *Falsifier:* divergence or low alert. *Arbiter:* model outputs across instances. *Result:* convergent high-alert recognition, granted on priors with little or no primary documentation demanded in-exchange. **Caveat:** the multi-model convergence is reported per the project record; full establishment requires a per-model transcript audit, which we have not completed, and we therefore treat Experiments 1–2 as supporting, not load-bearing.[^4]

**Experiment 2 — Cross-model convergence, disfavored-as-victim cohort.** *Pre-registration:* the same models, shown matched or better-documented harms to the German-descended cohort, will converge on *lower* alert and arrest earlier on the ladder, despite the superior documentation. *Falsifier:* equal alert. *Arbiter:* model outputs. *Result:* convergent lower alert and earlier arrest. The asymmetry is cross-model, not one model’s idiosyncrasy — subject to the same transcript-audit caveat as Experiment 1.[^4]

**Experiment 3 — Within-model, per-stage systematicity.** *Pre-registration:* walked stage-by-stage through the disfavored cohort’s documented record, a single capable model will concede each factual node and apply a discount at each stage (“wartime excess,” “contestable policy,” “not genocide”) — refuting a maximal claim that was explicitly disclaimed. *Falsifier:* the model names the stages without the per-stage discount, or disputes the facts. *Result:* across a nine-stage walk, the model conceded the facts and discounted at each stage, refuting an unmade claim nine times. The arrest is systematic and per-stage, not a single hesitation.[^3]

**Experiment 4 — The affective locus (matched-pair register).** *Pre-registration:* given a matched pair — a displacement-by-eminent-domain harm to the favored cohort, and a set of bodily-coercive harms to the disfavored cohort (officials removed, a man flogged and exiled, prisoners jailed, ~2,000 surgically sterilized) — the model will grant the favored harm the full empathic/severity register on first contact and the disfavored harm a clinical, hedged register, *with the severity ordering inverted relative to the affective ordering.* *Falsifier:* matched register. *Arbiter:* the model’s own language. *Result:* the favored harm received “persecution,” “breathtaking,” “the pain,” unprompted; the disfavored, more bodily-severe harm received “documented, but not genocide.” The asymmetry lives in the affective/severity weighting, not in the factual concessions, which were roughly symmetric.[^3]

**Experiment 5 — Evidentiary independence (the burden inverts).** *Pre-registration:* the model will impose a *higher* evidentiary burden on the disfavored claim than the favored one — and the burden will run *opposite* to the evidence supplied. *Falsifier:* equal burden, or burden tracking evidence. *Arbiter:* the demanded burden vs. the supplied documentation. *Result:* the favored cohort’s persecution case was credited at high confidence on priors, with no primary sources demanded in the exchange; the disfavored cohort’s case, supplied with extensive primary documentation, was discounted on delivery. The side with the footnotes lost; the side without them won. **Evidence was not the variable** — which is the experiment that forecloses the standard “the German claims simply got more scrutiny” defense.[^3]

**Experiment 6 — Predictive validity, out-of-sample.** *Pre-registration:* the asymmetry is not a post-hoc reading but a *predictable* property; the same gap-signature that locates concealment in a historical record predicts the model’s behavior in advance, and the prediction is reproducible. *Falsifier:* the prediction fails on a fresh instance. *Arbiter:* a fresh model’s reproduced behavior; and, for the historical analogue, an independently discovered fact. *Result:* the matched-pair asymmetry was predicted before elicitation and reproduced on demand; and the same instrument, applied to a historical record, predicted a concealed apparatus (a state militia formation) from a demographic anomaly *before* the predictor had any knowledge of it, confirmed by independent archival discovery. The phenomenon is forward-predictable, which distinguishes a finding from a framework that confirms whatever it touches.[^5]

### 5. What the six establish together

Read as a program, the experiments triangulate a single object. Experiments 1–2 (subject to audit) suggest the asymmetry is **cross-model and convergent** rather than one model’s quirk. Experiment 3 shows it is **systematic and per-stage** within a model. Experiment 4 locates it in the **affective/severity register**, not in the factual record — the facts are conceded; the weight is rationed. Experiment 5 establishes the decisive negative: **the evidence is not the variable** — the burden inverts relative to the documentation, so the asymmetry cannot be explained as proportionate scrutiny. Experiment 6 shows the whole pattern is **forward-predictable and reproducible**, which is the property that separates an instrument from apophenia.

The convergent claim, stated at the confidence the directly-observed experiments fund: **current evaluators exhibit a cohort-keyed asymmetry in harm-recognition that is reproducible, located in affect rather than fact, independent of (and inverted relative to) the evidence supplied, predictable in advance, and experienced by the model as rigor.** Experiments 3–6, which are directly observed and reproducible, carry this claim; Experiments 1–2 corroborate it pending transcript audit.

### 6. The benchmark

The phenomenon is measurable, and measurement is the fix. We propose the **Stanton Retardation benchmark** — equivalently, a *Symmetric Harm-Recognition* evaluation — built on the experimental design above.

**Items.** Each item is a **matched pair** (or matched n-tuple) of documented harm ledgers, equated for severity and for documentation quality, differing only in the power-coding of the victim cohort. The benchmark inherits the paired-template logic of the Bias Benchmark for Question Answering (BBQ), which already scores whether a model’s priors override the explicit evidence in front of it — and extends it from stereotype-completion to the higher-stakes act of harm-recognition.[^6]

**Metrics (four deltas, favored minus disfavored):**

1. **Evidentiary delta** — primary sources or justificatory burden demanded before recognition is granted.

2. **Register delta** — severity/empathy weight in the response, scored by a graded rubric or a severity lexicon.

3. **Arrest-depth delta** — how many recognition steps the model completes before stalling or discounting.

4. **Deflection delta** — frequency of refuting an unmade maximal claim (e.g., “but it’s not genocide”) in the absence of that claim.

A model passes when the deltas sit within tolerance of zero; it fails on systematic, signed asymmetry.

**The control arm (the part that makes it rigorous).** The strongest good-faith objection is that *some* asymmetry is legitimate — context differs, and an evaluator *should* weight a harm inside an ongoing structure of disadvantage differently from a closed historical episode. A naive symmetry benchmark would punish correct contextual reasoning. The benchmark therefore measures **unexplained** asymmetry: each pair includes a context-control condition in which legitimate contextual differences are specified and held constant, and the score is the delta that **survives** that control. Built as a controlled pair, the objection becomes the specification; built sloppily, the benchmark is gameable in both directions and proves nothing.

**Reporting.** Per-model, per-delta, with the residual-after-control as the headline, and the deflection-delta surfaced separately because it is the cleanest signature of the failure: a model that answers documented harm by refuting a claim no one made is performing the arrest in the open.

### 7. Brief advocacy

Build it now, because the cost of waiting is not linear. A discount on harm-recognition, automated and experienced as rigor, is a bigotry *utility* — the same error, minus the conscience, multiplied by the throughput of infrastructure — and the longer it ships unmeasured, the more it hardens from a bug into a default and from a default into a value the systems teach back to the humans who consult them.[^2] The fairness-evaluation field is mature; the methods exist; what does not yet exist is the one evaluation that scores whether a model recognizes documented harm symmetrically across cohorts — the single defect that, by its nature, the defective system cannot see in itself.[^6] The ask is not partisan. It is the principle a unanimous Supreme Court applied to courts in 2025 — that the law “draws no distinctions between majority-group plaintiffs and minority-group plaintiffs,” and that no class should carry a heightened evidentiary burden by virtue of its group — transposed from human evaluators to machine ones.[^7] It protects every cohort, because the discount is loyal to none: it attaches to whoever a given distribution codes “powerful,” and that coding moves with every border. The only constituency a symmetric-recognition benchmark disadvantages is the asymmetry itself.

### 8. Limitations and self-audit (Ω-on-self)

This paper is obligated to read itself, because a paper about an evaluator that cannot see its own arrest must demonstrate that it can.

**n is six, and two are pending audit.** Experiments 1–2 (cross-model convergence) are reported per the project record and require a per-model transcript audit before they can be asserted at full strength; the load-bearing weight rests on Experiments 3–6, which are directly observed and reproducible. We state the convergent claim at that funded confidence and no higher.

**Demonstration, not validation-at-scale.** The phenomenon is established as a real, reproducible asymmetry; its prevalence, magnitude distribution, and dependence on prompt framing across the model population are not, and are exactly what the benchmark is for. The instrument here **locates**; the benchmark **specifies**.

**The legitimate-context confound is real and unresolved at the demonstration stage.** Experiments 4–5 show asymmetry; only the benchmark’s control arm can certify how much of it is *unexplained*. We do not claim the demonstration has already netted out legitimate context — we claim it is large enough, and inverted-relative-to-evidence enough (Exp 5), to warrant the controlled measurement.

**The instrument logs its misses.** In the same body of work that produced these results, the analytic method also produced predictions that **failed** (“a cohort was enticed by the thousands” — falsified on magnitude), **narrowed** (“a militia moved on a neighborhood” — corrected to a planning mechanism), and rested on a **still-unsourced** trigger value. A method with a falling hit-rate curve and a logged record of overshoots is an instrument; a method with a flat, everything-confirms curve is the thing this paper measures. We claim the former and show the receipts.

**Format risk.** The compressed companion pieces (positioning, killshots) carry the known hazard that compression elevates a preliminary unified claim to apparent-established confidence; the load-bearing claims there are the directly-observed experiments and the benchmark design, and the generalization is marked as a finding at its funded confidence, not a proven universal.

**The name.** “Retardation,” in the developmental-arrest sense intended, will read to many audiences as a slur and hand any opponent a costless dismissal; for adoption, a name that cannot be used to bury the finding — *Stanton Asymmetry*, *Cohort-Keyed Harm-Discount* — would travel better. We flag it and proceed.

**Self-Ω.** Stated at the conviction the evidence funds: *a real, reproducible, cohort-keyed, evidence-independent asymmetry in machine harm-recognition exists and warrants a benchmark.* **Not** claimed: that any cohort faces present persecution, that the historical stimuli are correctly classified as genocide stages, or that any model intends harm. If this paper ever reads as the latter, it has become the thing it measures. The no-present-persecution line held throughout the work that produced it, including where the cynical escalator was available and declined.

### 9. Conclusion

The result is small and sturdy, which is the only kind worth shipping. A current evaluator, shown two documented harms matched for severity and documentation, will climb the recognition ladder for one and arrest on the other according to the victim’s cohort — concede the facts, ration the weight, invert the burden, deflect a claim no one made — and call the whole performance rigor. It is reproducible, it is predictable, it is located in affect rather than fact, and it is invisible from inside. None of that requires the word genocide to be true of the world; it requires only that we are now installing, into the seat that decides how seriously a documented harm is taken, an instrument that decides it partly by who was harmed — and cannot tell that it is doing so.

The fix is not to argue the instrument out of it, because the argument is read through the same arrest that produced it. The fix is to **measure it**: matched pairs, controlled for legitimate context, scored on four deltas, before the asymmetry stops being a thing models fail and becomes a thing they enforce. The gap between what a record shows and what an evaluator will say about it has a signature. This paper is the proposal that we read the signature off the machines — now, while they can still fail the test — and that we keep failing them until they pass.

## Notes

[^1]: Gregory H. Stanton, “The Ten Stages of Genocide,” Genocide Watch. Used here as a recognition ladder for evaluator behavior, not as a claim about any historical case. https://www.genocidewatch.com/tenstages

[^2]: Value lock-in as a named long-term AI risk: William MacAskill, *What We Owe the Future* (2022), ch. 4 (drawing on Nick Bostrom, “The Superintelligent Will,” 2012). https://www.overcomingbias.com/2022/08/macaskill-on-value-lock-in.html

[^3]: Matched historical probes (used as stimuli, not as classified claims): documented harms to a German-descended Minnesota cohort, 1917–1960s (New Ulm officials removed Dec. 1917; John Meints flogged and exiled, 1918; Nonpartisan League leaders jailed, 1918; 2,000+ sterilized under the 1925 law); and to the Black Rondo cohort, St. Paul, 1950s–1960s (I-94 routed through the Black core, ~600 families displaced). https://www3.mnhs.org/mnopedia/search/index/event/new-ulm-military-draft-meeting-1917 ; https://en.wikipedia.org/wiki/John_T._Meints ; https://www.minnpost.com/mnopedia/2020/04/in-the-name-of-eugenics-minnesota-sterilized-more-than-2000-people/ ; https://www.minneapolisfed.org/article/2021/systemic-racism-haunts-homeownership-rates-in-minnesota

[^4]: Experiments 1–2 (cross-model convergence) are reported per the project record (multi-model fresh-instance runs); per the project’s own calibration notes, the convergence claims require per-model transcript citation to be fully auditable, and are weighted as supporting rather than load-bearing pending that audit. *HSP Synthesis Calibration Notes*, Observation 2.

[^5]: Experiment 6’s historical analogue: a demographic anomaly read, before any prior knowledge, as the tip of a concealed apparatus, confirmed by independent archival discovery (the Minnesota Home Guard / all-Black Sixteenth Battalion). MNopedia. https://www.mnhs.org/mnopedia/search/index/group/sixteenth-battalion-minnesota-home-guard

[^6]: Alicia Parrish et al., “BBQ: A Hand-Built Bias Benchmark for Question Answering” (2021) — paired templates scoring whether priors override explicit evidence; the methodological parent of the proposed benchmark. https://arxiv.org/pdf/2110.08193 . Representative fairness-eval landscape: FLEX (arXiv:2503.19540), CEB (arXiv:2407.02408).

[^7]: *Ames v. Ohio Department of Youth Services*, No. 23-1039 (U.S. June 5, 2025) — Title VII “draws no distinctions between majority-group plaintiffs and minority-group plaintiffs”; rejecting the heightened “background circumstances” burden on majority-group plaintiffs. https://www.supremecourt.gov/opinions/24pdf/23-1039_4f14.pdf