The State of Voice AI Agent Performance

voice_ai_performance

There is a lot of excitement around real-time interactive Voice AI agents these days. We’re seeing Voice AI agents eat up existing B2B call volume, create novel experiences with character-based LLMs, and replace existing website workflows. And the Voice AI transformation is just beginning. There is so much more growth ahead of us.

Voice AI agents are the new website. They are a means for customers to learn more about a business and transact with a business. In the current state of technology, how effective are Voice AI agents at fulfilling their promise? In this blog post, we take a look at production Voice AI agent data to answer that question.

About Our Dataset

We’re building an analytics and evaluation platform for Voice AI agents. Our dataset gives us a unique vantage point on the state of modern LLM-based Voice AI agents.

Our customers build Voice AI agents for many different verticals, including agents for car dealerships, home service businesses, medical practices, financial services, and more. We have calls on our platform from all of the popular Voice AI orchestration platforms.

We randomly sampled 50,000 calls from our dataset. The call audio files and transcripts were uploaded to our platform during the month of February 2025. The calls are mostly in English and mostly in the US.

Voice AI Performance

Let's first take a look at success rate and the primary feature of Voice AI agent failure.

Voice AI Call Success Rate

We figured the best place to start was to look at Voice AI agent success rates. How effective are Voice AI agents at achieving their objective?

On our platform, Voice AI developers give us a list of successful call outcomes. If a Voice AI agent call has a successful outcome, then it achieved its objective. An example of a successful outcome is the Voice AI agent booked a follow-up appointment with a sales representative. If the developer does not provide us with the successful outcomes, then we take a sample of transcripts and ask an LLM to generate them.

voice_ai_performance

Our dataset shows that 42% of Voice AI agent calls have successful outcomes and meet their objective.

Is that a good number or a bad number? It’s difficult to benchmark because our dataset includes a very broad range of use cases. For example, you would expect a lower success rate on upselling a new customer on additional services compared to answering ‘Where is my order?’. We can, however, look to human call center data for a reference point. Human call center representatives typically resolve the customer issue on the first call (i.e., First Call Resolution) for about 70% of calls.

In other words, Voice AI agent success rate is just a little better than half the success rate of human agents.

This matches what we see when we inspect the call data on our platform. Some use cases are easy to automate and have a success rates greater than 90%. Other user cases are very difficult to automate and have success rates below 10%. But most use cases are neither simple nor impossible. Instead, they're use cases with just a lot of edge cases (e.g., it's 'mummy', not 'mommy', in British English). And you can't find these edge cases by randomly sampling calls or just reading transcripts.

Your Voice AI agent may seem like it is doing well enough in production from the few calls you manually sample, so why are your call success rates not as high as you and your customers expect? Analyze your Voice AI calls and you’ll see. It takes time and craftsmanship to build highly performant Voice AI agents.

Voice AI Call Duration

When a Voice AI call does not meet its objective, the most likely reason is that the caller hung up. Voice AI call duration is the best predictor of Voice AI call success.

voice_ai_performance

We use raincloud plots like the one above on our platform to visualize call duration. In the raincloud plot, the raincloud shows the probability distribution of call duration for numerous calls from one agent. The blue raindrops are successful calls where the call objective was met. The red raindrops are failed calls. The earth shows a boxplot of the quartiles, mean and range. As you can see in the raindrops, most of the failed calls are short. The example above is indicative of our broader dataset.

voice_ai_performance

In our dataset, the median duration for a successful call is 1 minute and 32 seconds. The median duration for a call that fails to meet the objective is 34 seconds. The median for all calls is 51 seconds.

Let’s again turn to human call center data for context. According to TalkDesk’s contact center report, TalkDesk’s customers have an average talk time with their customers of 2 minutes and 50 seconds.

So far, we’ve found that Voice AI agent calls fail more often and are shorter than human calls.

Let’s next dig deeper into the audio metrics and conversation metrics to understand why Voice AI calls are failing.

Voice AI Call Audio Metrics

Voice AI calls are failing because people are hanging up before giving the Voice AI a chance. What are callers hearing that clues them into the fact that they’re not talking to a human?

On our platform, we analyze the recordings from Voice AI agent calls for latency, interruptions, transcription errors, and so on. Let’s take a look and see if we can find out.

A quick note about the dataset. Some closed voice orchestration platforms only provide mono recordings (i.e., both the human and the AI are on the same channel). If you’re trying to calculate the speaking rate of the AI, you need to know when the AI is speaking. It’s difficult to diarize recordings with just one channel. In turn, our audio metrics from mono recordings are less accurate compared to our stereo recordings. For this section, we subset our dataset to only stereo calls (i.e., the human and the AI are on different channels).

Voice AI Latency

When we talk about latency in Voice AI, we mean the delay between when the human stops speaking and the AI starts speaking. Latency was the first major hurdle in bringing voices to LLMs. A long pause between responses is a dead giveaway that you’re talking to an AI. At the dawn of LLM-based Voice AI in late 2023, latency was the glaring user experience problem. These days, I sometimes hear people say the issue is solved.

voice_ai_performance

In our analysis, we measured latency as the time between when one speaker stops speaking and the other starts speaking. The median latency between when the human stops speaking the AI starts speaking is 1.95 seconds.

In the Voice AI community, we frequently say that humans respond in 500 milliseconds or less. In academic studies, the average response latency in English is reported to be about 250 milliseconds and tends to stay under 500 milliseconds. 1.95 seconds is well above that threshold.

We shouldn’t evaluate Voice AI latency in a vacuum. Consider, human reader, conversations you have had with other people in various contexts. Latency is a function of how well the speakers know each other, their emotional states, whether they are agreeing, how well they hear each other, the complexity of the conversation’s subject matter, the latency tendency of each of the speakers, the cultural upbringing of each of the speakers, and so on (see Section 5 here for a nice review).

This raised an interesting question for us. In our dataset, how long does it take for humans to respond to AI? Answer: the median is 1.82 seconds.

Well, that’s puzzling. A latency of 1.82 seconds is a far cry from the values reported in academic studies. What’s going on?

When a population statistic yields a surprising result, it’s always best to return to the source and examine the data. We looked at the data from numerous calls one by one. It’s hard to be sure what’s happening due to all the factors that influence turn-taking, but here’s our take.

First, in many calls, people realize they’re talking to a machine. Often, people hang up once they realize it, but others continue the conversation. And when they knowingly speak to an AI, they seem to speak with more deliberation.

Second, it is also a fairly regular occurrence that the human doesn’t seem to realize they’re talking to a machine. Real-time conversation is a complicated dance – even for humans! We’ve learned to make all sorts of adaptations, consciously and unconsciously, during a conversation to facilitate communication. Even in calls where people don’t realize they’re talking to an AI, they’re still adjusting how they speak based on the conversation flow. Real-time conversation is a complicated dance, and the AI is leading the dance.

In conclusion, we think the current state of Voice AI with respect to latency is passable. A solid C. Whether they realize it or not, latency is cluing people into the fact that something is amiss and they’re that much more likely to hang up.

C’s get degrees, but they probably won’t get you much contract expansion with your customers. And you know where most startups get their growth? Not those new business deals that are enthralled by your shiny sandboxed demo. It’s contract expansion. The performance of your agent in production drives the growth of your startup.

voice_ai_performance

Fortunately, our platform makes it easy to understand the latency in your Voice AI conversations. Here's an example of one of our latency visualizations.

Voice AI Call Initiation

What about the latency from picking up the call to when the AI starts speaking? When I accidentally pick up a telemarketing call, I know right away. The giveaway is the delay from when I answer to when the person starts speaking. The telemarketer is using an autodialer. There’s a delay between when the autodialer detects that I have answered and when the call is routed to a human to pitch me on buying solar panels.

voice_ai_performance

In our dataset, the time to start speaking is 880 milliseconds. The root cause isn't autodialers. The calls are taking too long to initiate. Given that humans easily handle conversation turn delays that are less than 250 milliseconds, 880 milliseconds for the Voice AI agent to start speaking is a long time.

880 milliseconds is ample time for me to hang up.

Percent Silence In Voice AI Calls

When researchers introduced a delay in telephone calls between two humans, callers waited longer between turns to make sure the other person had finished speaking. When there are call orchestration issues, speakers are unsure when to speak and therefore wait longer to speak.

In other words, latency sometimes isn’t a type of Voice AI orchestration problem. It is sometimes a symptom of Voice AI orchestration problems.

In our dataset, if 10% to 30% of the call recording is silence, then the call generally has gone well and the mean duration is 1 minute and 20 seconds. But when the percent silence increases above 30%, the duration drops to 56 seconds.

Voice AI Transcription Error and Call Success Rates

When I’m talking on the phone with someone who is in a noisy environment, I tell them that I can’t hear them well and ask them if they can speak in a quieter place or call me later. I have yet to encounter a Voice AI agent that realizes the call is noisy. Instead, the Voice AI agent just tries to power through it. It leads to transcription errors and call failures.

Some Automatic Speech Recognition (ASR) services such as Deepgram provide a confidence score. In our dataset, we can trace the path from low signal-to-noise ratio to low transcription confidence to lower rates of successful calls.

First, a note about how to interpret the confidence scores. We have to consider that the ASR confidence scores are generally high. In our analysis, we are taking the average of all the word-level transcription confidence scores in a call to yield the average call-level transcription score. The 25th, 50th and 75th percentile for the population of average call confidence scores in our dataset is 91%, 96%, and 99%.

In our dataset, the mean and standard deviation for Signal-To-Noise Ratio (SNR) on the human caller channel are 34 dB and 18 dB, respectively. A SNR of 34 dB is good. When I listen to calls with the human channel SNR below one standard deviation from the mean (i.e., <= 16 dB), I begin to have trouble understanding the person.

voice_ai_performance

For calls where the SNR is 34 dB or higher, the average ASR confidence is 96%. For calls that are one standard deviation below the mean SNR, the ASR confidence is 91%. In other words, more background noise and less speaker signal means that there are more transcription errors. And if the SNR is below 16 or the ASR confidence is below 91%, calls are 27% as likely to have a successful outcome. In other words, calls with good SNR are about 3.7 times more likely to have a successful outcome compared to noisy calls.

Curious about how the ASR model is performing for your Voice AI agent? You can track it on Voice AI analytics platform.

Voice AI Words Per Minute and Pitch

Most Voice AI developers recognize that the primary reason their calls are failing is that callers are hanging up as soon as they realize they’re talking to an AI. And so the most frequently recurring question we hear in the Voice AI community is, “What’s the most realistic text-to-speech voice?”.

There is a lot of art and science to making a synthetic voice sound human. There are some very tough issues. For example, most of the training data for TTS is from people reading books, so Voice AI often sounds like someone reading you a story.

voice_ai_performance

One of the telltale signs that I’m talking to an AI is the pace. In our dataset, humans speak at 156 words per minute. But the Voice AIs speak about 15% faster at 180 works per minute. Our friends at Rime told us the increased speaking rate is a vestige of the TTS model training process. And although speech speed is an adjustable parameter, we've found that decreasing the speech speed introduces more instability into the generated voice.

Another interesting thing we found in the data is related to pitch. We monitor pitch on our platform so our users can identify when their TTS model becomes unstable. It’s another one of those “we’ve already solved that” problems that we definitely have not solved. If we use 170 Hz as the arbitrary and admittedly very imperfect cutoff for the average frequency between male and female speakers, we find that 51% of human callers are female. That seems representative of human population statistics. On the other hand, the percentage of Voice AIs that use female voices is significantly higher than 51%, probably because male AIs are much more likely to turn out evil (e.g., HAL 9000).

Voice AI Conversational Metrics

So far, we’ve focused on audio metrics in effort to understand why Voice AI agent calls fail. Now I’d like to turn our attention to what we can learn from the conversational content of failed calls.

On our platform, we alert our customers about call anomalies. If a caller passed through most of the usual call stages, then the call generally has a successful outcome. If it doesn’t, we alert the customer in real-time. These calls have a lot of signal about how to improve the Voice AI agent. Similarly, we alert customers about calls that have a duration that typically results in success, but actually failed. Again, these are high-signal calls for improving Voice AI agents.

We took our dataset and subset to the calls with anomalies. We know the obvious reasons calls are failing (caller hangs up, call goes to voicemail). By looking at the anomalies in happy-path calls and long-duration calls, we can find the reasons why calls fail even when the caller stays on the line with the AI.

In looking through the data, some of the recurring issues are specific to certain types of Voice AI agents. For example, lead qualification Voice AI agents sometimes fail when the caller does not recall signing up for a service and refuses to confirm their identity. If we disregard the failure cases that seem application-specific, here are the top three general issues we found.

  1. The AI fails to understand the user. This includes ASR errors on specific items. But it also includes cases where the AI fails to really recognize the user’s request, which wasn’t accounted for in the system prompt. Humans are nature’s most unpredictable force. You really have to diligently and continuously analyze production calls to build great Voice AI agents.

  2. The human requests to speak to a representative. Remember, we’re looking at data from fairly long calls. A request to speak to a representative means that the caller attempted to work with the AI but the AI still failed.

  3. Caller does not confirm the next step. Voice AI’s are great at asking for the next step. In calls where a human may read the emotions of the caller and refrain from asking for the upsell, the AI just goes for it – even when it clearly should not. This is a limitation of the STT-LLM-TTS stack that requires the utmost care in designing bandaids and workarounds.

Conclusion

Ultimately, building effective Voice AI agents comes down to craftsmanship. There are a lot of rough edges in the models and tooling for building Voice AI agents. After all, we’ve only just started trying to make LLMs speak.

Voice AI developers have lots of levers for building these agents. They can change system prompts, tool call APIs, model configurations, and orchestration stacks. The first step to crafting great Voice AI agents, and winning more revenue, is figuring out what is going wrong. We can help with our Voice AI agent analytics platform.

Thanks for reading! If you have questions or ideas, please reach out to us. We’d love to hear from you.

Tom and Adrian
March 2025