Skip to content

audiobench

A reproducible CLI benchmark for audio ML models.

A single clean-set metric hides failure modes. audiobench reports performance across realistic perturbations and mixtures — so you find out where a model actually breaks, not just how it scores on the easy slice.

Get started View rankings View on GitHub


What's in the MVP

  • ab/asr-robust


    Speech recognition under noise, bandlimiting, and reverb. Per-condition WER plus a weighted mean. Default model: Whisper.

    Suite reference

  • ab/asr-hallucination


    Non-speech hallucination benchmark with ranked findings, bootstrap CIs, and holdout validation status.

    Suite reference

  • ab/sound-id


    Sound-event identification on labeled mixtures. Reports recall, precision, F1, and false-positive rate per mixture size. Default model: a bundled CPU heuristic.

    Suite reference

  • Signal suites


    Reference-aware fidelity, psychoacoustic masking, and stereo phase checks for any AudioProcessor adapter (codec, DSP chain, plug-in, neural enhancement).

    Fidelity · Psychoacoustics · Phase

  • Temporal task suites


    Frame-level event detection (IoU-matched F1) and speaker diarization (NIST DER with Hungarian alignment and a 0.25 s collar).

    SED · Diarization

  • Model adapters


    Bundled heuristics, LAION-CLAP zero-shot, and Qwen2-Audio-7B-Instruct (local GPU or remote endpoint).

    Models

  • Reproducibility built in


    Manifest, mixture, probe, and prompt seeds are pinned. Every run writes a JSON artifact with a run_hash.

    Reproducibility guarantees


In one command

pip install audiobench
audiobench run ab/sound-id --model heuristic-v0

That gets you a full ab/sound-id run on the bundled demo pack, no downloads, no GPU. From there:

audiobench run ab/sound-id --profile demo-fast --model heuristic-v0   --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json

The compare command dispatches on the suite id baked into each run JSON, so the same call works for ab/asr-robust (lower-WER-wins) and ab/sound-id (higher-recall-wins, lower-FPR-wins).


Benchmark your own model

If your goal is to evaluate your model, start from this flow:

  1. Implement the adapter protocol (answer(...) for ab/sound-id, transcribe(...) for ab/asr-robust).
  2. Register it in-repo, or expose it as a Python entry point.
  3. Run with your adapter id.
audiobench list-models
audiobench run ab/sound-id --model my-sound-model
audiobench run ab/asr-robust --model my-asr-model

The complete adapter and plugin setup lives in Bring your own model.


Where to go next


Telemetry and privacy

Default: audiobench does not phone home. Model adapters you choose (Whisper, Qwen, etc.) may download weights from Hugging Face — that is separate from audiobench telemetry.

First run (interactive terminal): you may see a one-time prompt asking whether to share anonymous usage stats (command name, suite, adapter, duration, success/failure). Default is no. Nothing is sent unless you accept or set AUDIOBENCH_TELEMETRY=1.

What we never collect: audio, transcripts, run JSON, file paths, IP addresses, or credentials. See the full schema in Telemetry reference.

Opt out anytime:

export AUDIOBENCH_TELEMETRY=0
# or delete ~/.config/audiobench/consent.json

Docs site: cookieless GoatCounter page views (aggregate country + referrer only). Honours Do Not Track.

PyPI: download counts by Python/OS; country breakdown via public BigQuery when we run the snapshot job. No install referrer from PyPI.

Public dashboard: Analytics merges PyPI, GitHub, mentions (HN / Reddit / Bluesky), and opt-in CLI aggregates.

Analytics setup guide · Take the 3-question survey