Reproducible controversy launch flow¶
Use this when you want to publish one controversy-grade claim from ab/asr-hallucination and keep it auditable.
What counts as publishable¶
audiobench now runs a detector pipeline on every ab/asr-hallucination run:
- bootstrap confidence intervals over per-domain hallucination uplift
- Benjamini-Hochberg multiple-testing correction across detector outputs
- deterministic discovery/holdout split for replication checks
Each finding gets a status:
validated: passes discovery gate and holdout replicationcandidate: passes discovery gate, holdout not yet confirmedrejected: fails significance, support, or reproducibility checks
Only runs with at least one validated finding are marked publishable.
1) Run the suite and save JSON artifacts¶
audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination-whisper-tiny.json
Repeat for any model you want to compare:
audiobench run ab/asr-hallucination --model my-asr-model --output results/hallucination-my-model.json
The CLI summary prints:
- ranked findings (effect size, CI, adjusted p-value)
- validation gate counts (
validated,candidate,rejected) - publishable/not-publishable badge
2) Compare competing runs¶
compare shows both headline hallucination deltas and each run's top finding status, so you can avoid presenting a weaker candidate as if it were validated.
3) Verify reproducibility checklist before publishing¶
In the run JSON, confirm:
validation_summary.reproducibility_checklistis alltruetop_finding.statusisvalidated(not justcandidate)findings_methodsrecords bootstrap sample count, confidence, and correction method
If any checklist field is false, do not publish the claim.
4) Re-run to confirm determinism¶
Re-run the same command and compare hashes:
audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination-whisper-tiny-rerun.json
For the same model/config/seed and identical per-clip outputs, run_hash should match. If not, treat the finding as not launch-ready.
5) Push validated artifacts¶
Leaderboard records carry the top finding status and validated-finding count, so downstream readers can filter to reproducible claims instead of raw score alone.