Adding a second human to an active voice AI call breaks assumptions your conversation layer has held since day one. Here is what changes in the audio layer, the transcript state, and the tool design when you build for multiple speakers.
Multi-participant voice AI calls look simple from the outside. The assistant is on a call. Someone new joins. Three people are talking. What could go wrong?
Quite a bit. Adding a second human to an active voice AI call breaks assumptions your conversation layer has held since the first call. Audio handling becomes more complex. The turn loop no longer has one speaker to track. Transcript state that was safe to accumulate across interruptions becomes a liability when two different people are talking. And the assistant needs to know, explicitly, when it should speak and when it should stay out of a conversation not directed at it.
This post walks through what it actually takes to build multi-participant voice AI calls: why diarization should start before transcription, a transcript attribution bug worth understanding, and the invite tool design that keeps the assistant from dialing anyone a user mentions.
A multi-participant voice AI call is a voice session with more than two active participants, where one participant is an . Unlike a conference bridge or a recording bot, the AI is an active participant: it identifies who is speaking, understands who each person is addressing, and chooses to respond or stay silent based on that context.
Related articles
This is different from:
The technical requirements are specific: preserving speaker identity, independent speech detection per participant, speaker attribution in the conversation context, and a mechanism for the assistant to skip its turn when the humans are talking to each other.
Most voice AI conversation agents are built around a 1:1 shape. One user stream, one assistant turn loop, one active speaker at a time. The conversation state machine assumes there is exactly one person who just spoke.
Adding a second human breaks at least four things.
Audio routing. In a 1:1 call, media flows between two legs. Adding a third participant changes the call into a shared conversation. The system now has to decide what every participant hears, how the assistant hears each participant, and how speaker identity is preserved across the call.
Diarization. Multi-participant voice AI depends on reliable diarization: the system has to know who said what, not just what was said. Diarization can work well, but availability and behavior vary by speech-to-text engine, and it can still fail. Preserving participant identity before transcription gives the assistant cleaner context and more reliable turn ownership.
Turn management. A conversation agent typically responds after every user turn. In a conversation between two humans, that behavior is disruptive. You need a mechanism that lets the assistant deliberately skip a turn when participants are clearly talking to each other.
Transcript state. This one is subtle, and we will cover it in detail below.
The most important design choice is where speaker identity is created.
One approach is to mix all participant audio into a single stream and rely on the underlying speech-to-text engine to separate speakers afterward. Diarization can work well, but not every transcription engine supports it, and speaker separation can fail. A real-time assistant is making turn-by-turn decisions while people are talking. It needs reliable speaker identity before it decides whether to answer, stay silent, or take an action.
A better pattern for real-time voice AI is to preserve participant identity as early as possible. The exact media topology can vary, but the goal is the same: keep speech detection, transcription, and turn ownership tied to a known participant from the start. The model receives conversation history with speaker identity attached, which makes it easier to reason about whether someone is addressing the assistant, another participant, or the group.
The principle is simple: do not recover speaker identity later if you can preserve it earlier.
The most instructive bug we hit during development was a transcript cross-contamination issue in the conversation turn loop. Here is how it happened:
This bug was invisible in 1:1 calls. In single-speaker sessions, appending a prior partial transcript across an interruption is a reasonable strategy. The same person keeps speaking, and carrying forward the unfinished fragment builds the complete utterance. The assumption that "previous transcript belongs to the current speaker" is trivially true when there is only one speaker.
In multi-participant calls, that assumption is false.
The root cause was a transcript concatenation function that did not check whether the prior transcript came from the same sender as the current channel. The fix required tracking sender identity alongside the transcript in conversation state, and only concatenating when the sender matched. Anywhere the transcript was cleared, the sender identity also had to be cleared.
The conceptual shift is the important part: any state in a voice AI conversation agent that crosses turn boundaries needs an explicit owner. In a single-speaker system, ownership is implicit. In a multi-speaker system, it must be tracked.
The assistant does not place a call by guessing at phone numbers. Instead, the invite mechanism works through a tool that operators configure with a set of allowed destinations.
The behavior has three modes depending on how many targets are configured:
| Targets configured | LLM behavior | Result |
|---|---|---|
| 0 | LLM provides phone or SIP directly | Full flexibility, no operator constraints |
| 1 | No LLM argument needed | Auto-selects the single configured target |
| 2 or more | LLM picks by name from the allowed set | LLM chooses between Specialist, Billing, etc. |
One detail worth understanding: participantName is not just a label. When set, the name is passed directly into the user messages the LLM receives. If Sarah says something, the model sees {name: "Sarah", content: "..."} in the conversation history. The AI knows who said what, turn by turn. This is what makes intelligent turn-taking reliable. The model can reason that two specific people are talking to each other and not addressing it.
This keeps operators in control of who the assistant can reach. The LLM makes a choice, but only within the set of destinations the operator has approved. A user saying "call my lawyer" does not cause the assistant to search for and dial a number. It matches a configured target or cannot proceed.
Multi-participant calling only feels natural when the assistant knows when not to speak.
The skip_turn tool gives the assistant an explicit way to stay silent for one turn. It calls this tool when it determines that participants are talking to each other rather than addressing it. Skip Turn does not end the call or disable the assistant. It skips exactly one response cycle, then the assistant is ready again on the next turn.
The instruction pattern that works:
When participants are talking to each other and not addressing you, use Skip Turn.
Only respond when someone addresses you directly or asks you to take an action.
Configure both tools via the Assistants API:
Hangup behavior is configurable. By default, when an invited participant leaves the conversation continues for the original caller. When the original caller leaves, the conversation ends for everyone. You can change both via on_invited_participant_hangup and on_original_participant_hangup on the Invite tool config.
Full implementation guide and configuration options: Multi-Participant Calls guide