Skip to content

Quickstart

Install

audiobench is an editable Python package. Python 3.10 or later.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Confirm the CLI is on your PATH:

audiobench --help
macOS + Python 3.13: ModuleNotFoundError: No module named 'audiobench'

If audiobench --help raises ModuleNotFoundError: No module named 'audiobench' immediately after pip install -e ., this is a known macOS + pip + Python 3.13 site.py interaction (Python issue #127012 / pip issue #13153).

pip-installed files inherit a com.apple.provenance xattr that carries the UF_HIDDEN flag, and Python 3.13's site.py skips .pth files with that flag, so the editable-install pointer never lands on sys.path. Clear the flag on the venv's site-packages:

chflags -R nohidden .venv/lib/python3.13/site-packages
macOS + iCloud Drive: hung Python imports

If your project sits under ~/Documents (or any iCloud Drive folder), macOS may evict .venv files under memory pressure (compressed,dataless xattr). Fresh Python processes then hang on read() of evicted .pyc files while iCloud tries to fetch them back. Move the project somewhere outside iCloud:

mkdir -p ~/code
cp -R ~/Documents/audiobench ~/code/
cd ~/code/audiobench
rm -rf .venv && python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Verify with find .venv -type f -flags +dataless | wc -l (should print 0).

First run: ab/sound-id on the demo pack

The demo pack runs end-to-end with no downloads and no GPU. Good first sanity check:

audiobench run ab/sound-id --model heuristic-v0

For each mixture, the model is asked once per candidate label using the bundled prompt set (canonical wording: "Do you hear a {label}?"). The benchmark scores how many components of the mixture were correctly identified.

You'll see four conditions:

  • solo — N=1 (sanity)
  • pair — N=2
  • triple — N=3
  • quad — N=4

Each (pack, condition) row reports:

  • recall — of the sounds actually in the mixture, what fraction did the model correctly say "yes" to? (1.0 = caught every component; lower = missed some.)
  • precision — of the times the model said "yes", what fraction were actually present? (1.0 = no false alarms; lower = it claims to hear things that aren't there.)
  • F1 — a single combined score blending recall and precision; useful when you want one number.
  • FPR (false-positive rate) — for sounds that are NOT in the mixture (distractors), how often does the model still say "yes"? (0.0 = never hallucinates; higher = it answers "yes" too eagerly.)

Headline number: components understood: X / Y — across every mixture, X is how many ground-truth components the model identified out of Y total. This is the number you'd quote in a tweet.

First run: ab/asr-robust

audiobench run ab/asr-robust --model whisper-tiny

Conditions: clean, noise-cafe-10db, noise-pink-5db, bandlimited-8k, reverb-medium. Reports per-condition WER and a weighted mean.

First run: ab/asr-hallucination

audiobench run ab/asr-hallucination --model whisper-tiny

Conditions: silence, music, noise. The summary includes ranked findings with bootstrap CIs, multiple-testing-corrected q-values, and validation status (validated, candidate, rejected).

Compare two models

audiobench run ab/sound-id --model heuristic-v0    --output results/sound-id-heuristic.json
audiobench run ab/sound-id --model heuristic-weak  --output results/sound-id-weak.json
audiobench compare results/sound-id-heuristic.json results/sound-id-weak.json

compare dispatches on the suite id in each run JSON, so the same command works for ab/asr-robust (lower-WER-wins), ab/asr-hallucination (lower hallucination wins, findings status surfaced), and ab/sound-id (higher-recall-wins, lower-FPR-wins).

For a live-presentation-friendly profile (~30 mixtures, finishes in under 90 s on a laptop):

audiobench run ab/sound-id --profile demo-fast --model heuristic-v0   --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json

Publish a run to the leaderboard

Login once to Hugging Face:

hf auth login

Then upload any run JSON:

audiobench push results/sound-id-heuristic.json --pretty-json

If you do not pass --repo, audiobench push automatically uses <your-username>/audiobench-leaderboard-submissions.

Benchmark your own model in 3 steps

  1. Implement an adapter (answer(...) for sound-id, transcribe(...) for ASR).
  2. Register it in audiobench, or ship it as a plugin via entry points.
  3. Run it with your adapter id.
# Discover available adapter ids and ASR patterns.
audiobench list-models

# Your own sound-id adapter.
audiobench run ab/sound-id --model my-sound-model

# Your own ASR adapter.
audiobench run ab/asr-robust --model my-asr-model

See Bring your own model for a minimal adapter stub and plugin setup.

What's next