Voice

Control Voice AI response timing with Start Speaking Plan

Voice AI agents fail in production when they can't determine when a user has finished speaking. Most systems assume 0.5 seconds of silence means the turn is over. But users pause while thinking, reading numbers, or looking at screens. The agent jumps in too early.

The 0.5-second rule breaks in 3 situations:

Thinking Pauses: Users frequently pause to think, look at screens, or formulate their answers.
Slow Dictation: When users dictate numbers or other information slowly, the system misinterprets the pauses as the end of their turn.
Processing Delays: If there's a delay in the agent's response, users might say "hello, are you there?" prematurely, causing the agent to interrupt.

In each of these scenarios, the agent cuts in too early, leading to an incomplete understanding of the user's intent by the LLM and subsequent failures in tool calls.

Start Speaking Plan fixes this.

You can now configure wait times for Telnyx Voice AI Agents based on speech patterns and context.

How it works

Telnyx Voice AI Agents now distinguish between 4 types of pauses:

Wait seconds sets your baseline. A customer service agent might use 0.3 seconds for snappy responses. An agent calling into an IVR system needs 1.5 seconds to account for slower robotic speech.
On punctuation seconds handles high-confidence endpoints. When the transcription ends with a period or question mark, the user likely finished their thought. Set this to 0.1 seconds for minimal delay.
On No punctuation seconds handles uncertainty. The user said "my order number is" and paused. They're probably looking at their screen. Set this to 1.5 seconds so the agent doesn't interrupt with "I didn't catch that" while they're reading the digits.
On number seconds handles digit sequences specifically. People read numbers slowly: "4... 7... 2... 9." Each pause could trigger a response. Extending this to 1.0 seconds prevents the agent from cutting them off at "4... 7..."

Here's a configuration for an agent collecting order information:

json

"interruption_settings": {
    "start_speaking_plan": {
      "wait_seconds": 0.4,
      "transcription_endpointing_plan": {
        "on_punctuation_seconds": 0.1,
        "on_no_punctuation_seconds": 1.5,
        "on_number_seconds": 0.5
      }
    }
  }

And here's one optimized for IVR interactions where the agent is navigating another automated system:

json

{
  "interruption_settings": {
    "start_speaking_plan": {
      "wait_seconds": 1.5,
      "transcription_endpointing_plan": {
        "on_punctuation_seconds": 1.0,
        "on_no_punctuation_seconds": 2.0,
        "on_number_seconds": 1.5
      }
    }
  }
}

Smart endpointing vs. explicit control

Some platforms offer "smart endpointing" that uses ML to predict when users are done talking. This works well for standard English conversations. It breaks on edge cases: users with accents, users speaking languages other than English, users dictating numbers, and agents interacting with IVR systems.

We're shipping explicit controls instead. You know your use case better than an algorithm does. If you're building a healthcare scheduler, you need different timing than an e-commerce order tracker. If your users are older and speak more slowly, you need different timing than a tech support bot for developers.

Start Speaking Plan is available now in the Mission Control Portal under Voice AI Assistant settings.

Configure it, test with your worst-case scenarios, and deploy with confidence that your webhooks will actually fire.

Checkout our documentation to see some configuration examples for common use cases and a debugging guide if you're seeing tool-calling failures.

Share on Social

Control Voice AI response timing with Start Speaking Plan

How it works

Smart endpointing vs. explicit control

Jump to:

Sign up for emails of our latest articles and news

Sign up and start building.