Skip to content

ab/fidelity-roundtrip

Signal-level fidelity benchmark for any process that turns audio into audio: codecs, DSP chains, neural enhancement, format converters, plug-ins.

audiobench run ab/fidelity-roundtrip --model passthrough

What it measures

The suite tests one question: does the audio that comes out match the audio that went in? It is deliberately reference-aware. Each stimulus is rendered twice — once as the reference and once through the processor under test — and the two are compared in four complementary ways:

Metric What it captures Direction
si_sdr_db Scale-invariant signal-to-distortion ratio. Robust against benign gain changes; punishes added noise, ringing, and dropouts. higher is better
mr_stft_log_l1 Multi-resolution STFT log-magnitude L1 distance across 256 / 1024 / 4096-point FFTs. Catches spectral coloration that SI-SDR misses (e.g. low-pass, EQ tilt). lower is better
true_peak_dbtp 4× oversampled inter-sample peak in dBTP. Anything > 0 dBTP will clip in a downstream DAC. lower is better
loudness_delta_lu ITU-R BS.1770-style K-weighted LUFS delta between reference and output. Detects mastering-level drift. absolute value lower is better

Definitions are in src/audiobench/signal_metrics.py.

Stimuli (procedural, no downloads)

Stimulus Description What it stresses
sine-440 Pure 440 Hz tone at -6 dBFS Pitch / frequency-domain accuracy
sine-sweep Logarithmic 50 Hz → 7.5 kHz sweep Wideband response, group delay
white-noise Seeded Gaussian noise Broadband spectral fidelity
impulse-train Sparse impulses every 200 ms Transient response, pre/post-ringing
low-level-sine 1 kHz at -40 dBFS Quantization-noise floor
high-headroom-sine 880 Hz at -0.26 dBFS Inter-sample peak / clipping behavior

Stimuli are deterministic and embedded in the wheel. The manifest hash that goes into run_hash includes the stimulus list, so changing one will refuse to compare against runs made on the old set.

Conditions

Each stimulus is also pre-perturbed before the processor sees it. The same reference is used for scoring, so a processor must both survive the perturbation and not add its own distortion.

Condition What it does
identity Untouched reference — the baseline.
bandlimit-8k Butterworth low-pass to 4 kHz (mimics a 8 kHz narrowband input). A flat passthrough scores well; a "smart" model that hallucinates HF content scores poorly.
gain-+3db Pre-applies +3 dB and clips. A clean processor preserves the clip; a hot processor stacks more clipping on top.

Adapter contract

The model is an AudioProcessor with one method:

def process(audio: np.ndarray, sample_rate: int) -> tuple[np.ndarray, int]:
    ...

Bundled adapters:

  • passthrough — identity. The fidelity upper bound; should pass every gate.
  • passthrough-quantize8 — 8-bit quantizer. Demonstrates an SI-SDR / MR-STFT regression while keeping loudness and true peak mostly intact.
  • polarity-flip-right — flips the right channel polarity. Doesn't change the fidelity headline much on its own (this suite is mono-sensitive); see ab/phase-coherence for the catch.

Register your own under audiobench.signal_models or in models/signal_registry.py.

Headline and gate keys

The run JSON's headline block exposes:

{
  "weighted_si_sdr_db": 80.31,
  "max_true_peak_dbtp": -0.04,
  "mean_loudness_delta_lu": 0.0,
  "weighted_mr_stft_log_l1": 0.0
}

The weighted SI-SDR is the equal-weighted mean across (stimulus, condition) cells. max_true_peak_dbtp is the worst inter-sample peak across all output clips, so a single clipping cell will surface here.

Gate file keys (gate.yaml → fidelity_roundtrip:):

  • min_weighted_si_sdr_db — floor on weighted SI-SDR.
  • max_true_peak_dbtp — ceiling on worst-case inter-sample peak.
  • max_mean_loudness_delta_lu — ceiling on abs(mean_loudness_delta_lu).

CLI shortcuts: --min-si-sdr, --max-true-peak.

Useful flags

audiobench run ab/fidelity-roundtrip --model passthrough-quantize8
audiobench run ab/fidelity-roundtrip --model passthrough --conditions identity,bandlimit-8k
audiobench compare results/fidelity-clean.json results/fidelity-codec.json

Scope

Mono, 16 kHz. The metrics are reference-faithful (apples-to-apples), so this suite is for transparency-style claims. For perceptual MOS estimates, see future plans in docs/index.md (codec-perceptual is in design).