ab/diarization-cw¶
Speaker diarization on procedurally generated multi-speaker conversations. Scored with NIST-style Diarization Error Rate (DER) and a Hungarian alignment of hypothesis speakers to references.
What it measures¶
Each clip is a synthetic conversation: 1–3 "speakers", each rendered as a
distinct formant-like timbre, alternating in a fixed turn schedule. The
adapter under test returns a list of turns — {speaker_id, start_s, end_s} —
with anonymous speaker labels (spk-1, model-A, anything). The suite
aligns those anonymous labels to the reference speakers, then reports the
canonical DER decomposition.
Diarization Error Rate¶
DER is computed at 50 ms frame granularity:
$$ \text{DER} = \frac{T_\text{miss} + T_\text{FA} + T_\text{confusion}}{T_\text{ref speech}} $$
For each eligible frame (frames outside the collar around reference boundaries):
- Miss — frames where the reference has more active speakers than the
hypothesis does.
miss = max(0, |R| - |H|)per frame. - False alarm — frames where the hypothesis has more active speakers than
the reference.
fa = max(0, |H| - |R|)per frame. - Confusion — frames where the right number of speakers is detected
but the wrong identities.
confusion = min(|R|, |H|) - aligned_hits.
aligned_hits is the number of reference speakers whose Hungarian-assigned
hypothesis speaker is also active in this frame.
Collar¶
A 0.25 s symmetric window is excluded around every reference turn boundary before scoring. This is the standard NIST convention and is what lets the metric tolerate the boundary jitter that all real diarizers exhibit.
Hungarian alignment¶
Hypothesis speakers are mapped to reference speakers by minimum-cost
assignment, where the cost between (r, h) is the negative number of
frames where both are co-active. Implementation falls back to a greedy
nearest-cost mapping when the matrix is large; for the bundled fixtures
(≤ 3 speakers per side) the exact Hungarian is used.
Metric code: temporal_metrics.diarization_error_rate.
Stimuli¶
| Clip | Duration | Speakers | Turn structure |
|---|---|---|---|
cw-001 |
10.0 s | 2 | A 0.5–3.0 · B 3.2–5.5 · A 6.0–8.0 · B 8.2–9.5 |
cw-002 |
12.0 s | 3 | A 0.0–2.5 · B 2.8–5.0 · C 5.5–8.0 · A 8.5–11.5 |
cw-003 |
8.0 s | 2 | A 0.5–4.5 · B 4.6–7.5 |
cw-004 |
6.0 s | 1 | A 0.5–5.5 (single-speaker baseline) |
cw-005 |
10.0 s | 2 | A 0.5–4.0 · B 3.5–7.0 (overlap) · A 7.2–9.5 |
Each speaker timbre is a deterministic two-formant excitation plus a tiny
amount of noise. Like ab/sed-urban, the audio is not meant to be realistic
speech — it's a scoring-pipeline harness for diarization adapters.
Adapter contract¶
class DiarizationAdapter(Protocol):
name: str
def diarize(self, audio: np.ndarray, sample_rate: int) -> list[dict]:
...
Speaker ids are anonymous; the suite aligns them. Returned turns can be in any order; the scorer quantizes them onto the 50 ms grid.
Bundled adapters:
oracle-diarization— perfect (reads the ground truth viaset_oracle_hint). Sanity check; DER = 0.merged-diarization— collapses every speaker into one. The classic confusion regression: miss = 0, FA = 0, confusion >> 0.single-speaker— emits one big turn covering the whole clip. The classic false-alarm + confusion regression: silences between turns become FA, multi-speaker frames become confusion.
Real adapters ignore the oracle hint. Register your own under
audiobench.diarization_models or in
models/diarization_registry.py.
Headline and gate keys¶
{
"der": 0.0,
"miss_rate": 0.0,
"false_alarm_rate": 0.0,
"confusion_rate": 0.0,
"mean_speaker_count_error": 0.0,
"clip_count": 5,
"frame_s": 0.05,
"collar_s": 0.25
}
der— total DER, weighted by per-clip speech frames so longer clips carry more weight.- The three component rates always sum to
der. mean_speaker_count_error = mean(|n_ref_speakers - n_hyp_speakers|)across clips. A separate signal from DER — diarizers can hit reasonable DER while consistently mis-counting speakers.
Gate file keys (gate.yaml → diarization_cw:):
max_der— ceiling on DER (0–1).max_speaker_count_error— ceiling onmean_speaker_count_error.max_miss_rate,max_false_alarm_rate— finer-grained component ceilings when you want to forbid one failure mode specifically.
CLI shortcuts: --max-der, --max-speaker-count-error.
Useful flags¶
audiobench run ab/diarization-cw --model merged-diarization # confusion regression
audiobench run ab/diarization-cw --model single-speaker # FA regression
audiobench compare results/diar-oracle.json results/diar-mymodel.json
audiobench gate results/diar-mymodel.json --max-der 0.2 --max-speaker-count-error 1
Scope and caveats¶
- Audio is procedural, not real speech. Use a public benchmark (AMI, VoxConverse, DIHARD) for production claims; this suite validates that your adapter wiring is correct and that your DER scorer behaves as expected.
- Overlap is supported (
cw-005) but is rare in the bundled set. Stress testing on heavy overlap requires more fixtures (planned). - The Hungarian implementation is exact for ≤ 6×6 matrices and greedy above that. Adapters that emit hundreds of micro-speakers will fall onto the greedy path — keep speaker counts realistic.