/docs · SignalLab · Deep

Speaker turn segmentation: the pragmatic stack

Diarisation is a hard ML problem. Speaker turn segmentation, the cheap version, is solved enough to use everywhere.

True diarisation — “tag each segment of audio with which speaker is talking” — is a hard ML problem and an active research area. Speaker turn segmentation — “find the moments where the speaker changes” — is much easier and useful for most downstream applications.

For an audio archive, “show me the turns” is the more common request than “tell me who’s speaking.” This doc is about how to build the easier version well.

The two problems, separated

Problem	Hardness	Tools
Speaker turn segmentation	Tractable on signal alone	Energy + spectral envelope changes
Speaker identification	Hard, needs embeddings	Models like pyannote, NeMo, Resemblyzer
Speaker diarisation	Combined version of both	Hard, end-to-end ML systems

Most people who ask for “diarisation” actually want turn segmentation. Confirm before you reach for the heavy tools.

The signal-only approach

A speaker turn shows up in the signal as a discontinuity. The simplest features that capture it:

Energy envelope discontinuity: speakers have different baseline loudness. A turn often manifests as a step change in short-term RMS.
Spectral envelope change: different speakers have different formant patterns. Even without identifying who, you can detect that a change happened.
Pause structure: turns are usually bracketed by short pauses (50–300 ms). The pause-then-different-spectrum pattern is highly predictive.

A simple pipeline:

def segment_turns(samples, sr, min_turn_sec=2.0):
    # 1. Compute MFCCs in 25 ms frames, 10 ms hop
    mfccs = mfcc(samples, sr, n_mfcc=13, hop=int(0.01 * sr))
    # 2. Compute cosine distance between consecutive MFCC frames
    dists = [cosine_distance(mfccs[i], mfccs[i+1]) for i in range(len(mfccs)-1)]
    # 3. Smooth and threshold
    smoothed = moving_average(dists, window=5)
    threshold = percentile(smoothed, 95)
    # 4. Find peaks above threshold separated by min_turn_sec
    peaks = find_peaks(smoothed, height=threshold, distance=int(min_turn_sec / 0.01))
    return [p * 0.01 for p in peaks]  # convert to seconds

This is not a state-of-the-art diarisation system. It’s a 30-line function that catches 80% of turns on clean two-speaker material.

Why it works

Speakers differ enough in their spectral envelopes (driven by formant positions, vocal tract length, pitch) that MFCC vectors land in distinct regions of feature space. When the speaker changes mid-recording, the MFCC vector “jumps” across that space, and the cosine distance between consecutive frames spikes.

The simple smoothing and percentile thresholding handle the bulk of the noise. You will miss turns where speakers have similar voices, where one speaker is much quieter, or where overlap is heavy — but for a useful timeline on most podcast and interview material, this works.

When to escalate to real diarisation

You need the heavier ML stack when:

More than 2–3 speakers, especially with overlap.
Cross-channel mixing: a single mono file with multiple distant mics.
Speaker identification is required (not just “a turn happened” but “speaker A is back”).
Noisy real-world audio: street recordings, meetings with bad acoustics.

For these, pyannote.audio is currently the best open-source diarisation pipeline. NeMo is a more enterprise-grade alternative. Both will run on GPU and require ~1-5 GB of model weights — not browser-friendly without significant work.

What SignalLab does

The SignalLab Indexer (browser-side preview) uses the signal-only turn segmentation pattern described above. The full server-side version routes to pyannote when diarisation accuracy matters. The browser-side version is good enough for QA and discovery; the server-side version is good enough for production archives.

The decision tree:

Are you doing fast QA at ingest? Use signal-only segmentation.
Are you producing per-speaker metadata for an archive? Use pyannote.
Are you doing realtime transcription with speaker labels? Use the ASR provider’s built-in diarisation (Deepgram, AWS).