audiobench¶
A reproducible CLI benchmark for audio ML models.
A single clean-set metric hides failure modes. audiobench reports performance across realistic perturbations and mixtures — so you find out where a model actually breaks, not just how it scores on the easy slice.
Get started View rankings View on GitHub
What's in the MVP¶
-
ab/asr-robust
Speech recognition under noise, bandlimiting, and reverb. Per-condition WER plus a weighted mean. Default model: Whisper.
-
ab/asr-hallucination
Non-speech hallucination benchmark with ranked findings, bootstrap CIs, and holdout validation status.
-
ab/sound-id
Sound-event identification on labeled mixtures. Reports recall, precision, F1, and false-positive rate per mixture size. Default model: a bundled CPU heuristic.
-
Signal suites
Reference-aware fidelity, psychoacoustic masking, and stereo phase checks for any
AudioProcessoradapter (codec, DSP chain, plug-in, neural enhancement). -
Temporal task suites
Frame-level event detection (IoU-matched F1) and speaker diarization (NIST DER with Hungarian alignment and a 0.25 s collar).
-
Model adapters
Bundled heuristics, LAION-CLAP zero-shot, and Qwen2-Audio-7B-Instruct (local GPU or remote endpoint).
-
Reproducibility built in
Manifest, mixture, probe, and prompt seeds are pinned. Every run writes a JSON artifact with a
run_hash.
In one command¶
That gets you a full ab/sound-id run on the bundled demo pack, no downloads, no GPU. From there:
audiobench run ab/sound-id --profile demo-fast --model heuristic-v0 --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json
The compare command dispatches on the suite id baked into each run JSON, so the same call works for ab/asr-robust (lower-WER-wins) and ab/sound-id (higher-recall-wins, lower-FPR-wins).
Benchmark your own model¶
If your goal is to evaluate your model, start from this flow:
- Implement the adapter protocol (
answer(...)forab/sound-id,transcribe(...)forab/asr-robust). - Register it in-repo, or expose it as a Python entry point.
- Run with your adapter id.
audiobench list-models
audiobench run ab/sound-id --model my-sound-model
audiobench run ab/asr-robust --model my-asr-model
The complete adapter and plugin setup lives in Bring your own model.
Where to go next¶
- New here? Start with the quickstart.
- Running on a real dataset? See packs and bring-your-own-data.
- Trying Qwen2-Audio? The qwen2-audio guide has a Modal recipe and a free Colab fallback for laptops without a GPU.
- Adding a model? Models overview covers the adapter protocol.
- Publishing scores? Hugging Face leaderboard integration shows the Space +
audiobench pushflow.
Telemetry and privacy¶
Default: audiobench does not phone home. Model adapters you choose (Whisper, Qwen, etc.) may download weights from Hugging Face — that is separate from audiobench telemetry.
First run (interactive terminal): you may see a one-time prompt asking
whether to share anonymous usage stats (command name, suite, adapter,
duration, success/failure). Default is no. Nothing is sent unless you
accept or set AUDIOBENCH_TELEMETRY=1.
What we never collect: audio, transcripts, run JSON, file paths, IP addresses, or credentials. See the full schema in Telemetry reference.
Opt out anytime:
Docs site: cookieless GoatCounter page views (aggregate country + referrer only). Honours Do Not Track.
PyPI: download counts by Python/OS; country breakdown via public BigQuery when we run the snapshot job. No install referrer from PyPI.
Public dashboard: Analytics merges PyPI, GitHub, mentions (HN / Reddit / Bluesky), and opt-in CLI aggregates.