audiobench¶
A reproducible CLI benchmark for audio ML models.
A single clean-set metric hides failure modes. audiobench reports performance across realistic perturbations and mixtures — so you find out where a model actually breaks, not just how it scores on the easy slice.
What's in the MVP¶
-
ab/asr-robust
Speech recognition under noise, bandlimiting, and reverb. Per-condition WER plus a weighted mean. Default model: Whisper.
-
ab/asr-hallucination
Non-speech hallucination benchmark with ranked findings, bootstrap CIs, and holdout validation status.
-
ab/sound-id
Sound-event identification on labeled mixtures. Reports recall, precision, F1, and false-positive rate per mixture size. Default model: a bundled CPU heuristic.
-
Model adapters
Bundled heuristics, LAION-CLAP zero-shot, and Qwen2-Audio-7B-Instruct (local GPU or remote endpoint).
-
Reproducibility built in
Manifest, mixture, probe, and prompt seeds are pinned. Every run writes a JSON artifact with a
run_hash.
In one command¶
That gets you a full ab/sound-id run on the bundled demo pack, no downloads, no GPU. From there:
audiobench run ab/sound-id --profile demo-fast --model heuristic-v0 --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json
The compare command dispatches on the suite id baked into each run JSON, so the same call works for ab/asr-robust (lower-WER-wins) and ab/sound-id (higher-recall-wins, lower-FPR-wins).
Benchmark your own model¶
If your goal is to evaluate your model, start from this flow:
- Implement the adapter protocol (
answer(...)forab/sound-id,transcribe(...)forab/asr-robust). - Register it in-repo, or expose it as a Python entry point.
- Run with your adapter id.
audiobench list-models
audiobench run ab/sound-id --model my-sound-model
audiobench run ab/asr-robust --model my-asr-model
The complete adapter and plugin setup lives in Bring your own model.
Where to go next¶
- New here? Start with the quickstart.
- Running on a real dataset? See packs and bring-your-own-data.
- Trying Qwen2-Audio? The qwen2-audio guide has a Modal recipe and a free Colab fallback for laptops without a GPU.
- Adding a model? Models overview covers the adapter protocol.
- Publishing scores? Hugging Face leaderboard integration shows the Space +
audiobench pushflow.