Skip to content

ab/sed-urban

Sound event detection (SED): on a known soundscape, can the model produce correctly-labeled events at the right timestamps?

audiobench run ab/sed-urban --model oracle-sed

What it measures

Each clip is a 10-second procedurally rendered soundscape. A small set of labeled events (siren, dog bark, alarm, engine, glass breaking) is placed at fixed timestamps over a pink-noise bed. The adapter under test returns a list of detected events — {label, start_s, end_s} — and the suite scores those hypotheses against the ground truth two complementary ways:

Metric Definition What it captures
event_f1_iou50 Micro-averaged F1 over per-label IoU-matched events at IoU ≥ 0.5 Are individual onsets/offsets correct? Catches over- and under-segmentation.
segment_f1_1s Micro F1 over 1-second presence/absence grids per label Are events approximately in the right place? Lenient on boundaries.

The two together give a fair picture: event-F1 is the canonical SED number (strict), segment-F1 is the resilient one (lenient). A model that nails labels but jitters boundaries scores poorly on event-F1 and well on segment-F1; a model that gets time right but mislabels everything fails both.

Metric code lives in src/audiobench/temporal_metrics.py.

Event-F1 matching

For each clip we greedily match reference and hypothesis events by IoU, within each label, in decreasing-IoU order. Pairs with IoU < threshold are not matched. Unmatched reference events become false negatives; unmatched hypothesis events become false positives. Micro precision / recall / F1 are then computed across the entire clip set.

Segment-F1

The clip is sliced into 1-second segments. For each (label, segment) cell we mark presence (1) or absence (0) in both reference and hypothesis grids. F1 is computed on the flattened cells. Empty-on-empty clips (no reference, no hypothesis) score 1.0 — the model correctly produced nothing.

Stimuli

Clip Events placed
urban-001 siren 1.0–4.0s · dog_bark 5.5–6.2s · dog_bark 7.0–7.5s
urban-002 engine 0.0–6.0s · glass_breaking 3.5–3.9s · alarm 7.5–9.5s
urban-003 three short alarm bursts at 0.5, 3.0, 5.5s
urban-004 siren 2.0–8.0s · dog_bark 4.0–4.7s
urban-005 background-only (model must produce nothing)

Each labeled signal is a deterministic procedural rendering (frequency-swept tone for sirens, two-pulse envelope for barks, gated 1.2 kHz tone for alarms, low rumble for engines, exponentially-decaying noise burst for glass). The audio is not realistic — it is separable on purpose, so non-trivial adapters can pick up real signal without needing the suite to bundle a real SED dataset.

Adapter contract

class SEDAdapter(Protocol):
    name: str
    def detect(self, audio: np.ndarray, sample_rate: int) -> list[dict]:
        ...

Each returned event must include "label", "start_s", "end_s". "confidence" is accepted but ignored by the bundled scorer.

Bundled adapters:

  • oracle-sed — perfect detector that reads the ground truth (sanity check that the wiring works end-to-end; should hit event-F1 = 1.0).
  • oracle-sed-jittered — oracle with ±0.4 s alternating boundary jitter (canonical regression demo for the IoU threshold).
  • null-sed — returns nothing (worst case for recall, perfect on clip-005).

Real adapters ignore the oracle hint and rely on the audio. Register your own under audiobench.sed_models or in models/sed_registry.py.

Headline and gate keys

{
  "event_f1_iou50": 1.0,
  "event_precision_iou50": 1.0,
  "event_recall_iou50": 1.0,
  "segment_f1_1s": 1.0,
  "clip_count": 5,
  "iou_threshold": 0.5,
  "segment_s": 1.0
}

Gate file keys (gate.yaml → sed_urban:):

  • min_event_f1 — floor on event_f1_iou50.
  • min_segment_f1 — floor on segment_f1_1s.
  • min_event_recall — floor on event_recall_iou50 (useful when you care more about not missing events than about clean precision).

CLI shortcuts: --min-event-f1, --min-segment-f1.

Useful flags

audiobench run ab/sed-urban --model oracle-sed-jittered
audiobench compare results/sed-oracle.json results/sed-mymodel.json
audiobench gate results/sed-mymodel.json --min-event-f1 0.6 --min-segment-f1 0.7

Scope and caveats

The bundled stimuli are procedural — they validate the scoring pipeline and the adapter contract. They are intentionally not a stand-in for real SED benchmarks (DESED, AudioSet-strong). The architecture supports plugging in additional fixtures via a future pack mechanism, matching how ab/sound-id handles bring-your-own datasets.