Blog
Research

Turn-taking - the art of not interrupting

End-of-turn prediction, interruption handling, and the timing tradeoffs behind natural voice AI.

Schedule a demo

We thought we solved turn-taking once our AI stopped cutting people off.

We didn’t.

In insurance tests, the system no longer interrupted — but conversations still felt broken.

Users finished speaking. → The AI waited.

→ The silence lingered just long enough to feel wrong.

People weren’t annoyed.

They were unsure.

What turn-taking really controls

Humans don’t say when they’re done talking.

They signal it. → Pitch shifts. → Micro-pauses. → Breath. → Rhythm.

We read these cues without thinking.

The AI didn’t.

Even with perfect transcription and correct answers, weak turn-taking made the system feel robotic.

Users would:

  • wait in silence
  • start speaking and get talked over
  • repeat themselves to force a response

Nothing was “wrong.”

The timing was.

The two ways turn-taking failed

Every failure fell into one of two buckets:

  1. Interrupting during thinking pauses
  2. Waiting after the turn had clearly ended

Fixing one usually broke the other.

We had to solve both at once.

Problem 1: knowing when a turn ends

Silence isn’t a signal.

Context is.

“I went to the doctor and then…” isn’t finished.
“I went to the doctor.” probably is.

Early systems treated both pauses the same.

That worked until real speech showed up.

People pause to think.

To breathe.

To hold the floor.

Silence alone was useless.

Problem 2: handling real interruptions

Interruptions aren’t accidents.

They’re intentional.

Correction.

Urgency.

Clarification.

In healthcare especially, users cut in constantly.

If the system kept talking for even 500ms after being interrupted, trust collapsed.

Backchannels can lag.

Interruptions can’t.

When someone cuts in, the system has to stop — immediately.

What we tried

Audio-only

Fast.

Responsive.

It tracked:

  • pitch drops
  • volume changes
  • pauses

It worked for ideal speakers.

Then accents, flat intonation, and atypical rhythm broke it.

The system reacted quickly — and cut people off.

Text-only

Accurate.

Context-aware.

It used:

  • syntax
  • discourse markers
  • explicit cues

It understood intent.

But transcription latency mattered.

Even correct responses arrived too late.

The system always felt behind.

Audio + text (what worked)

We stopped choosing between speed and understanding.

Audio handled immediacy.

Text handled intent.

  • When both agreed → respond
  • When they conflicted → wait just long enough

Audio might signal an ending.

Text might signal continuation.

The fused model reads both.

That balance held up in real conversations.

The edge cases

Hesitation ≠ turn end

“um”
“uh”
“you know”

These hold the floor.

Early models heard the pause and jumped in.

We had to treat fillers as continuation markers.

Harder than it sounds — fillers vary wildly across accents.

Story pauses vs. real endings

“Yesterday I woke up early, then…
[pause]
I went to work.”

Same acoustics as a true ending.

Different meaning.

Prosody helped:

  • rising pitch → continuation
  • falling pitch → completion

Text alone wasn’t enough.

Audio alone wasn’t either.

Speed vs. accuracy

Too fast → interruptions

Too slow → lag

Testing gave us ranges:

  • 250–400ms felt natural for standard responses
  • 500–800ms worked better for complex or sensitive topics

There is no universal number.

Timing has to adapt to what’s happening, not just silence.

People don’t all speak the same way

Fast speakers sound finished early.

Slow speakers stretch turns.

Non-native speakers flatten pitch.

We saw massive regional variation.

The fix wasn’t better averages.

It was adaptation.

The system now learns a user’s rhythm within the first few turns.

By turn three or four, it’s calibrated.

What we learned

Turn-taking is invisible when it works.

Catastrophic when it doesn’t.

It runs in milliseconds but depends on context built across the entire conversation.

Every new domain breaks assumptions:

  • healthcare ≠ insurance
  • support ≠ scheduling

We keep refining.

Current results:

  • 94% end-of-turn accuracy
  • <2% false interruptions

At that point, users stop managing the system.

They just talk.

Ready to transform your customer conversations?

Join leading enterprises using AveraLabs to deliver human-level service at AI speed

Schedule a demo