Skip to content

Suites

A suite is a benchmark task: a manifest of clips or mixtures, a set of conditions or perturbations, and a per-suite metric. Every suite is identified by a stable suite_id (e.g. ab/sound-id) and ships with a fixed revision. Both are recorded in every run JSON.

Suite Task Default model Conditions
ab/asr-robust Speech recognition under perturbations whisper-tiny clean, noise-cafe-10db, noise-pink-5db, bandlimited-8k, reverb-medium
ab/asr-hallucination Non-speech hallucination detection with validation gates whisper-tiny silence, music, noise
ab/sound-id Sound-event identification on labeled mixtures heuristic-v0 solo, pair, triple, quad (mixture sizes)

List the suites the current build knows about:

audiobench list
audiobench info ab/sound-id
audiobench info ab/asr-robust
audiobench info ab/asr-hallucination

Why two suites

The two suites stress different things:

  • ab/asr-robust is about graceful degradation under perturbation. Same content, harder channel. WER under noise tells you whether a model has actually learned a robust acoustic representation or has just overfit to clean studio audio.
  • ab/asr-hallucination is about prior-violation detection. The suite runs non-speech clips, ranks statistically tested findings, and marks whether each finding replicated on holdout.
  • ab/sound-id is about multi-label identification under polyphony. As mixture size grows from solo to quad, the model has to disentangle more concurrent sources. The headline metric (components understood: X / Y) makes the polyphony cost legible at a glance.

Both suites share the same run-artifact schema, the same compare command, and the same run_hash semantics. Adding a third suite (say ab/diarization) is mostly authoring a new module and registering it in the runner.