Suites¶
A suite is a benchmark task: a manifest of clips or mixtures, a set of conditions or perturbations, and a per-suite metric. Every suite is identified by a stable suite_id (e.g. ab/sound-id) and ships with a fixed revision. Both are recorded in every run JSON.
| Suite | Task | Default model | Conditions |
|---|---|---|---|
ab/asr-robust |
Speech recognition under perturbations | whisper-tiny |
clean, noise-cafe-10db, noise-pink-5db, bandlimited-8k, reverb-medium |
ab/asr-hallucination |
Non-speech hallucination detection with validation gates | whisper-tiny |
silence, music, noise |
ab/sound-id |
Sound-event identification on labeled mixtures | heuristic-v0 |
solo, pair, triple, quad (mixture sizes) |
List the suites the current build knows about:
audiobench list
audiobench info ab/sound-id
audiobench info ab/asr-robust
audiobench info ab/asr-hallucination
Why two suites¶
The two suites stress different things:
ab/asr-robustis about graceful degradation under perturbation. Same content, harder channel. WER under noise tells you whether a model has actually learned a robust acoustic representation or has just overfit to clean studio audio.ab/asr-hallucinationis about prior-violation detection. The suite runs non-speech clips, ranks statistically tested findings, and marks whether each finding replicated on holdout.ab/sound-idis about multi-label identification under polyphony. As mixture size grows fromsolotoquad, the model has to disentangle more concurrent sources. The headline metric (components understood: X / Y) makes the polyphony cost legible at a glance.
Both suites share the same run-artifact schema, the same compare command, and the same run_hash semantics. Adding a third suite (say ab/diarization) is mostly authoring a new module and registering it in the runner.