Reproducible controversy launch flow¶

Use this when you want to publish one controversy-grade claim from ab/asr-hallucination and keep it auditable.

What counts as publishable¶

audiobench now runs a detector pipeline on every ab/asr-hallucination run:

bootstrap confidence intervals over per-domain hallucination uplift
Benjamini-Hochberg multiple-testing correction across detector outputs
deterministic discovery/holdout split for replication checks

Each finding gets a status:

validated: passes discovery gate and holdout replication
candidate: passes discovery gate, holdout not yet confirmed
rejected: fails significance, support, or reproducibility checks

Only runs with at least one validated finding are marked publishable.

1) Run the suite and save JSON artifacts¶

audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination-whisper-tiny.json

Repeat for any model you want to compare:

audiobench run ab/asr-hallucination --model my-asr-model --output results/hallucination-my-model.json

The CLI summary prints:

ranked findings (effect size, CI, adjusted p-value)
validation gate counts (validated, candidate, rejected)
publishable/not-publishable badge

2) Compare competing runs¶

audiobench compare \
  results/hallucination-whisper-tiny.json \
  results/hallucination-my-model.json

compare shows both headline hallucination deltas and each run's top finding status, so you can avoid presenting a weaker candidate as if it were validated.

3) Verify reproducibility checklist before publishing¶

In the run JSON, confirm:

validation_summary.reproducibility_checklist is all true
top_finding.status is validated (not just candidate)
findings_methods records bootstrap sample count, confidence, and correction method

If any checklist field is false, do not publish the claim.

4) Re-run to confirm determinism¶

Re-run the same command and compare hashes:

audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination-whisper-tiny-rerun.json

For the same model/config/seed and identical per-clip outputs, run_hash should match. If not, treat the finding as not launch-ready.

5) Push validated artifacts¶

hf auth login
audiobench push results/hallucination-my-model.json --pretty-json

Leaderboard records carry the top finding status and validated-finding count, so downstream readers can filter to reproducible claims instead of raw score alone.