/docs · VoiceLab · Intro
Why filler density matters more than filler count
Counting "ums" is the wrong measurement. Density tells you whether they are a problem.
Every podcast QA tool wants to count your fillers. Most of them stop there. Counting fillers as a raw number is mostly noise: a 5-minute interview and a 90-minute discussion produce very different totals, but the experience of listening to them depends on something else entirely.
That something is density: fillers per minute of actual speech.
The measurement
filler_density = filler_count / speech_seconds * 60
Note the divisor: not total file duration, but speech seconds. A podcast with five-second filler bursts followed by silence is different from one where every second is dense with hesitation.
The thresholds we use in VoiceLab:
| Density | Listener experience |
|---|---|
| < 3 /min | Smooth, professional |
| 3–6 /min | Conversational, fine for talk-shows |
| 6–10 /min | Noticeable, may need editing |
| > 10 /min | Distracting — heavy edit pass needed |
These are guidelines, not laws. A high-energy improv podcast at 12 fillers/min can be brilliant. A scripted explainer at 5 fillers/min is broken.
Why count alone misleads
Take two podcasts:
- Podcast A: 60 minutes, 40 fillers. Total: 40 fillers.
- Podcast B: 10 minutes, 30 fillers. Total: 30 fillers.
By count, Podcast A is “worse.” By density, Podcast A is 0.67/min and Podcast B is 3/min. The latter is a more honest summary of what a listener will feel.
Pause length matters too
A 200ms pause is usually invisible. A 600ms pause is dramatic. A 1500ms pause is a hesitation, often replacing a filler. VoiceLab tracks pause distribution alongside filler density, because replacing a filler with a long hesitation isn’t actually an improvement — it’s the same problem in a different shape.
How VoiceLab detects fillers without ASR
Without speech-to-text, exact filler counting isn’t possible. But useful proxies are:
- Sub-syllabic energy bursts: short envelope peaks bracketed by silence shorter than a typical word but longer than a click. These are heavily correlated with “um”, “uh”, “like”, “you know”.
- Pause-cluster density: regions where 80–400ms pauses cluster more than 4× per 10 seconds. This catches the speech rhythm typical of filler-heavy delivery even when individual fillers aren’t detected.
The signal-level layer is good enough for QA. For exact counts and word-level work, you need ASR (Whisper, Deepgram, AWS Transcribe) downstream.
What to do with the number
If your density is above 6/min and you’re editing the show, three tactics from working editors:
- Cut, don’t replace. Empty space between sentences sounds better than a filler in 80% of cases.
- Tighten before you cut fillers. Most “filler” perception is actually pace. Tightening overall delivery makes fillers fade into the background.
- Leave intentional ones. A “you know what I mean?” with character beats a sterile script. Don’t edit the personality out.
Related
- A pragmatic loudness target for podcasts — once you’ve cleaned the delivery
- Estimating room echo without RT60 — the other thing that makes voices sound amateur
More in VoiceLab docs
- Practical
A pragmatic loudness target for podcasts
The right loudness target for spoken-word podcasts is not the same as music. Here are the numbers and why.
- Deep
Estimating room echo without RT60
You don’t need an impulse response to know your room is too live. Here are three pragmatic measurements.