Quickstart¶

Install¶

audiobench is an editable Python package. Python 3.10 or later.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Confirm the CLI is on your PATH:

audiobench --help

macOS + Python 3.13: ModuleNotFoundError: No module named 'audiobench'

If audiobench --help raises ModuleNotFoundError: No module named 'audiobench' immediately after pip install -e ., this is a known macOS + pip + Python 3.13 site.py interaction (Python issue #127012 / pip issue #13153).

pip-installed files inherit a com.apple.provenance xattr that carries the UF_HIDDEN flag, and Python 3.13's site.py skips .pth files with that flag, so the editable-install pointer never lands on sys.path. Clear the flag on the venv's site-packages:

chflags -R nohidden .venv/lib/python3.13/site-packages

macOS + iCloud Drive: hung Python imports

If your project sits under ~/Documents (or any iCloud Drive folder), macOS may evict .venv files under memory pressure (compressed,dataless xattr). Fresh Python processes then hang on read() of evicted .pyc files while iCloud tries to fetch them back. Move the project somewhere outside iCloud:

mkdir -p ~/code
cp -R ~/Documents/audiobench ~/code/
cd ~/code/audiobench
rm -rf .venv && python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Verify with find .venv -type f -flags +dataless | wc -l (should print 0).

First run: `ab/sound-id` on the demo pack¶

The demo pack runs end-to-end with no downloads and no GPU. Good first sanity check:

audiobench run ab/sound-id --model heuristic-v0

For each mixture, the model is asked once per candidate label using the bundled prompt set (canonical wording: "Do you hear a {label}?"). The benchmark scores how many components of the mixture were correctly identified.

You'll see four conditions:

solo — N=1 (sanity)
pair — N=2
triple — N=3
quad — N=4

Each (pack, condition) row reports:

recall — of the sounds actually in the mixture, what fraction did the model correctly say "yes" to? (1.0 = caught every component; lower = missed some.)
precision — of the times the model said "yes", what fraction were actually present? (1.0 = no false alarms; lower = it claims to hear things that aren't there.)
F1 — a single combined score blending recall and precision; useful when you want one number.
FPR (false-positive rate) — for sounds that are NOT in the mixture (distractors), how often does the model still say "yes"? (0.0 = never hallucinates; higher = it answers "yes" too eagerly.)

Headline number: components understood: X / Y — across every mixture, X is how many ground-truth components the model identified out of Y total. This is the number you'd quote in a tweet.

First run: `ab/asr-robust`¶

audiobench run ab/asr-robust --model whisper-tiny

Conditions: clean, noise-cafe-10db, noise-pink-5db, bandlimited-8k, reverb-medium. Reports per-condition WER and a weighted mean.

First run: `ab/asr-hallucination`¶

audiobench run ab/asr-hallucination --model whisper-tiny

Conditions: silence, music, noise. The summary includes ranked findings with bootstrap CIs, multiple-testing-corrected q-values, and validation status (validated, candidate, rejected).

Compare two models¶

audiobench run ab/sound-id --model heuristic-v0    --output results/sound-id-heuristic.json
audiobench run ab/sound-id --model heuristic-weak  --output results/sound-id-weak.json
audiobench compare results/sound-id-heuristic.json results/sound-id-weak.json

compare dispatches on the suite id in each run JSON, so the same command works for ab/asr-robust (lower-WER-wins), ab/asr-hallucination (lower hallucination wins, findings status surfaced), and ab/sound-id (higher-recall-wins, lower-FPR-wins).

For a live-presentation-friendly profile (~30 mixtures, finishes in under 90 s on a laptop):

audiobench run ab/sound-id --profile demo-fast --model heuristic-v0   --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json

Publish a run to the leaderboard¶

Login once to Hugging Face:

hf auth login

Then upload any run JSON:

audiobench push results/sound-id-heuristic.json --pretty-json

If you do not pass --repo, audiobench push automatically uses <your-username>/audiobench-leaderboard-submissions.

Benchmark your own model in 3 steps¶

Implement an adapter (answer(...) for sound-id, transcribe(...) for ASR).
Register it in audiobench, or ship it as a plugin via entry points.
Run it with your adapter id.

# Discover available adapter ids and ASR patterns.
audiobench list-models

# Your own sound-id adapter.
audiobench run ab/sound-id --model my-sound-model

# Your own ASR adapter.
audiobench run ab/asr-robust --model my-asr-model

See Bring your own model for a minimal adapter stub and plugin setup.

What's next¶

ab/sound-id — full reference, including packs, custom mixtures, and recipe files.
ab/asr-robust — perturbation list and WER reporting.
ab/asr-hallucination — non-speech hallucination metrics plus validated findings.
Models — bundled adapters and how to plug in your own.
qwen2-audio-7b without a local GPU — Modal recipe, Colab fallback.
Hugging Face leaderboard integration — publish run JSON artifacts to a Space-backed leaderboard.
Reproducible controversy launch flow — policy and rerun checklist before publishing a claim.