Turn-taking - the art of not interrupting

What turn-taking really controls

‍

Humans don’t say when they’re done talking.

‍

They signal it. → Pitch shifts. → Micro-pauses. → Breath. → Rhythm.

‍

We read these cues without thinking.

‍

The AI didn’t.

Even with perfect transcription and correct answers, weak turn-taking made the system feel robotic.

‍

Users would:

wait in silence
start speaking and get talked over
repeat themselves to force a response
‍

Nothing was “wrong.”

The timing was.

‍

‍

The two ways turn-taking failed

‍

Every failure fell into one of two buckets:

Interrupting during thinking pauses
Waiting after the turn had clearly ended
‍

Fixing one usually broke the other.

We had to solve both at once.

‍

Problem 1: knowing when a turn ends

‍

Silence isn’t a signal.

Context is.

‍

“I went to the doctor and then…” isn’t finished.

“I went to the doctor.” probably is.

‍

Early systems treated both pauses the same.

That worked until real speech showed up.

‍

People pause to think.

To breathe.

To hold the floor.

‍

Silence alone was useless.

‍

Problem 2: handling real interruptions

‍

Interruptions aren’t accidents.

‍

They’re intentional.

Correction.

Urgency.

Clarification.

‍

In healthcare especially, users cut in constantly.

‍

If the system kept talking for even 500ms after being interrupted, trust collapsed.

Backchannels can lag.

‍

Interruptions can’t.

When someone cuts in, the system has to stop — immediately.

‍

What we tried

‍

Audio-only

‍

Fast.

Responsive.

‍

It tracked:

pitch drops
volume changes
pauses

It worked for ideal speakers.

‍

Then accents, flat intonation, and atypical rhythm broke it.

The system reacted quickly — and cut people off.

‍

Text-only

‍

Accurate.

Context-aware.

‍

It used:

syntax
discourse markers
explicit cues

It understood intent.

‍

But transcription latency mattered.

Even correct responses arrived too late.

The system always felt behind.

‍

Audio + text (what worked)

‍

We stopped choosing between speed and understanding.

Audio handled immediacy.

‍

Text handled intent.

When both agreed → respond
When they conflicted → wait just long enough
‍

Audio might signal an ending.

Text might signal continuation.

‍

The fused model reads both.

That balance held up in real conversations.

‍

The edge cases

‍

Hesitation ≠ turn end

‍

“um”
“uh”
“you know”

These hold the floor.

‍

Early models heard the pause and jumped in.

‍

We had to treat fillers as continuation markers.

Harder than it sounds — fillers vary wildly across accents.

‍

Story pauses vs. real endings

‍

“Yesterday I woke up early, then…
[pause]
I went to work.”

Same acoustics as a true ending.

Different meaning.

‍

Prosody helped:

rising pitch → continuation
falling pitch → completion
‍

Text alone wasn’t enough.

Audio alone wasn’t either.

‍

Speed vs. accuracy

‍

Too fast → interruptions

Too slow → lag

‍

Testing gave us ranges:

250–400ms felt natural for standard responses
500–800ms worked better for complex or sensitive topics
‍

‍

There is no universal number.

Timing has to adapt to what’s happening, not just silence.

‍

People don’t all speak the same way

‍

Fast speakers sound finished early.

Slow speakers stretch turns.

Non-native speakers flatten pitch.

We saw massive regional variation.

‍

The fix wasn’t better averages.

It was adaptation.

The system now learns a user’s rhythm within the first few turns.

By turn three or four, it’s calibrated.

‍

What we learned

‍

Turn-taking is invisible when it works.

Catastrophic when it doesn’t.

It runs in milliseconds but depends on context built across the entire conversation.

‍

Every new domain breaks assumptions:

healthcare ≠ insurance
support ≠ scheduling
‍

We keep refining.

‍

Current results:

94% end-of-turn accuracy
<2% false interruptions
‍

At that point, users stop managing the system.

They just talk.