`ab/sound-id`¶

Sound-event identification on mixtures of labeled clips. For each mixture, the model is asked once per candidate label using the bundled prompt set.

audiobench run ab/sound-id --model heuristic-v0

How it scores¶

For each (pack, condition) row:

recall — of the sounds actually in the mixture, what fraction did the model correctly say "yes" to?
precision — of the times the model said "yes", what fraction were actually present?
F1 — combined score blending recall and precision.
FPR — for sounds that are NOT in the mixture (distractors), how often does the model still say "yes"?

The headline is components understood: X / Y — across every mixture, X is how many ground-truth components the model identified out of Y total. That's the number meant for a tweet.

Conditions¶

Conditions are mixture sizes:

Condition	N	Notes
`solo`	1	sanity check
`pair`	2
`triple`	3
`quad`	4	hardest, polyphony stress test

Run a subset:

audiobench run ab/sound-id --model heuristic-v0 --conditions solo,pair

Packs¶

Each ab/sound-id run targets one or more packs. Each pack defines a label set and source dataset(s).

Pack	Source	Labels	License
`demo`	Procedural (bundled, no download)	`siren`, `alarm`, `dog_bark`, `engine`, `glass_breaking`, `baby_cry`, `coughing`, `water`, `vacuum`, `speech`	bundled
`core`	FSD50K (single-positive PP filter)	~80 high-confidence classes from the AudioSet ontology	CC-BY 4.0 / CC0 (user-supplied)
`home`	DESED synthetic subset	`alarm_bell`, `cat`, `dishes`, `frying`, `blender`, `water`, `speech`, `vacuum`, `dog`, `electric_shaver`	open (user-supplied)
`cabin`	FSD50K + UrbanSound8K	`engine`, `traffic`, `baby_cry`, `music`, `speech`, `car_horn`, `siren`, `drilling`	non-commercial research
`security`	UrbanSound8K	`gun_shot`, `siren`, `car_horn`, `dog_bark`, `jackhammer`	non-commercial research
`health`	ESC-50 medical subset	`coughing`, `sneezing`, `breathing`, `snoring`, `crying_baby`	non-clinical scope

The demo pack runs with no downloads and powers the headline demo. Other packs require user-supplied data at ~/.cache/audiobench/sound_id/<source>/.

audiobench list-packs
audiobench info ab/sound-id
audiobench info ab/sound-id --pack home

Bringing your own data¶

~/.cache/audiobench/sound_id/
  fsd50k/
    FSD50K.dev_audio/...
  urbansound8k/
    audio/fold1/...
  desed/
    synthetic21_train/soundscapes/...
  esc50/
    audio/...

audiobench info ab/sound-id --pack <name> prints the expected layout. If files are missing, the suite skips that pack with a clear message rather than failing.

Mixtures¶

Three layers, additive.

Default — canned, seeded mixture set per pack¶

Zero authoring. The default mixture set is deterministic from the pack and the seed:

audiobench run ab/sound-id --pack demo --model heuristic-v0

Inline `--mix` — one mixture per flag¶

+-separated labels. Repeatable.

audiobench run ab/sound-id --mix "siren+glass_breaking+baby_cry" --model heuristic-v0
audiobench run ab/sound-id --mix "engine+baby_cry" --mix "engine+baby_cry+music" --model heuristic-v0

Recipe file (YAML or JSON)¶

Repeatable scenarios with per-source dB levels and optional pinned source files:

scenarios/factory_floor.yaml

mixtures:
  - name: factory_alarm
    labels: [siren, glass_breaking]
    snr_db: 0

  - name: cabin_baby_over_engine
    label_levels:
      engine: 0
      baby_cry: -3
      vacuum: -6

audiobench run ab/sound-id --recipes scenarios/factory_floor.yaml --model heuristic-v0

When --mix or --recipes is used, results land under a custom condition. The run hash includes the canonicalized mixture spec so any custom run is bit-reproducible.

Mixture preview¶

Render a mixture WAV without running probes — useful for demo prep, debugging levels, and authoring recipes:

audiobench mix preview --labels siren,glass_breaking,baby_cry --output preview.wav
audiobench mix preview --recipes scenarios/factory_floor.yaml --name cabin_baby_over_engine --output cabin.wav

Per-mixture forensic view¶

audiobench inspect results/sound-id-heuristic.json --mixture 12

mixture 12 (pack=demo, condition=triple)
  ground truth: siren, glass_breaking, dog_bark
  source clips:
    siren           demo://siren@0
    glass_breaking  demo://glass_breaking@0
    dog_bark        demo://dog_bark@0

  model: heuristic-v0
  prompts: version=yesno-v1, parser=v1, ensemble=off (single prompt), source=bundled
  yes responses:
    siren           ✓
    dog_bark        ✓
    glass_breaking  ✗  FALSE NEGATIVE
    chainsaw        ✗  FALSE POSITIVE (distractor)
    car_horn        ✗  (distractor, correct)

  recall    : 2/3 = 0.67
  precision : 2/3 = 0.67
  components understood: 2 of 3

When the run was made with --prompt-ensemble N, inspect also prints a per-paraphrase breakdown showing each rendered prompt and the model's individual yes/no for it.

ab/sound-id¶