The Complete Guide To AI Voice Evaluation
Most Voice AI developers are flying blind. They don’t know how their product is performing in production. AI Voice evaluation and quality control for most Voice AI developers means listening to a few randomly selected calls every two weeks. Or worse, waiting until the customer complains.
This is not how to build a great Voice AI agent:
- Wait to receive a complaint from a customer.
- Listen to the call recording from the complaint.
- Guess what went wrong from a sample size of n = 1.
You don't want to find out from your customer that a caller said to your Voice AI agent, "Transfer me to a carbon-based life form!". This is one of the most important lessons I learned from years of running tech products. When something goes wrong with your product, the worst thing you can do is just hope your customer doesn't notice.
And with more and more developers swarming to Voice AI, agent quality will be the driving differentiator for closing deals. Craftsmanship is what sets great Voice AI agents apart.
This blog post is a guide on how to analyze and improve your Voice AI assistant.
Improving Your Voice AI Agent Through Data-Driven Analysis
Great product people are obsessed with finding and fixing the rough edges of their UI and UX. I’ve seen a very broad range of quality in production Voice AIs. Craftsmanship is what sets great Voice AI agents apart. The best Voice AI developers are continuously finding and fixing issues with their Voice AIs in production. Every day, they find a new surprise in how humans interact with their agent. Every day, they continually hew the rough edges, improving their system prompts, pushing changes to their tool call APIs, reconfiguring their third-party model providers, and so on.
To improve your Voice AI agent, use these analytical methods and data visualizations.
Voice AI Caller Journey Maps
The first step to improving your Voice AI agent is simply to orient yourself in how it is performing. Caller journey maps show you how callers are interacting with your agent at scale. The best Voice AI developers we’ve met are constantly refreshing the caller journey maps on our dashboard, fixated on learning what their agent is doing and how they can improve it.
Caller journey maps start with understanding the stages of a call. The call stages are categories of the interactions callers have with your Voice AI assistant. For example, if you’re building a Voice AI agent for answering a service business’s after-hours calls, then the stages may be something like: Greeting, Determine Call Type (i.e., Emergency, Non-Emergency), Transfer To On-Call Manager For Emergency, Schedule Callback for Non-Emergency, Conclude Call.
When a Voice AI developer integrates a new assistant with our platform, we take the first fifteen calls to determine the call stages (developers can edit the stages to their liking afterwards). Once we have the call stages, we assign each conversation turn in a call to one of the stages. This yields a call path for each call. Add all the call paths to one visualization and you get a caller journey map.
We have two types of caller journey map visualizations: Call Flows and Call Maps.
Call Flows
Call Flows are Sankey diagrams of the calls. They quickly give you the bigger picture of what’s happening in your Voice AI calls. If you’re wondering how your agent performed and what it did yesterday, look at the Sankey Diagram. Looking at a Sankey diagram for a Voice AI agent, I can readily see the happy path that most calls are going down. I also can easily check that the stages are progressing in a way I expect.
Call Maps
Call Maps are a Mermaid diagram of the calls. Although Call Maps are less visually appealing compared to Call Flows, I personally prefer the Call Maps for surfacing which calls need attention. When I look at a Call Map, I can find the following issues.
- Identify where callers are dropping off. The red nodes tell me the stage where the most calls have dropped off. The yellow nodes tell me the stage where any call has dropped off.
- Spot unexpected branches off the happy path (i.e., the sad paths). Humans can take a call in unexpected directions. The branches off the main path tell you what cases need to be improved in your system prompt.
- Inspect individual calls on drop-off nodes and sad paths. The most consistent way to improve an agent is by digging into the data in calls that did not go as expected.
Voice AI Call Outcomes
Ultimately, you and the business for whom you built the Voice AI agent care most about the outcome of the call. Was it successful? Did the Voice AI assistant fulfil the caller’s objective?
Voice orchestration platforms provide call outcome analysis for each call. Speaking from my own experience building Voice AI assistants, and from talking to, well, any and all Voice AI builders, the call outcome analysis on these platforms is inaccurate and inconsistent. This is unacceptable if you’re getting paid based on outcome (i.e., for each call where the caller agrees to something).
Here’s the issue with out-of-the-box evals from the voice orchestration platforms. It’s the same problem you encounter when using LLM eval platforms for data from multi-turn conversations. They work well if you hand them short inputs. That is, you hand them a user input of one or a few sentences. But LLMs get lost on larger inputs, like multi-turn Voice AI conversations.
To classify call outcomes accurately, we take a novel approach. The trick is to break the problem into smaller pieces. Whether you’re human or AI, the best way to solve a problem is to break it into smaller pieces.
When developers upload calls for a new Voice AI agent to our platform, we determine the default call stages, as I mentioned above. In addition, we determine the potential results of the call. These are not judgement-based results, but instead are factual results. For example, the list of potential results may be: Voice AI scheduled a callback, Voice AI did not schedule a callback. (Like the call stages, we give the Voice AI developer the ability to edit the list of results).
Then, for each call that gets sent to our platform, we ask the LLM to classify the call as one of potential results. For example, did the call end in a callback getting scheduled?
Finally, we ask the LLM to classify the result as either a successful outcome where the caller’s objective was met, or a failed outcome where the objective was not met. In the end, we get an accurate classification on the call outcome.
The best Voice AI developers are most interested in the failed call outcomes. It is in the failures that you can find insights to improve your agent. With our sunburst plots, we subset to just the failed calls and make it is easy to see what went wrong.
Voice AI Custom Analysis Metrics
When a caller interacts with your Voice AI, it’s a journey. Building a great Voice AI means tracking milestones in the journey. It isn’t enough just to know the outcome. It’s important to know how the caller reached the destination. This is where our user-defined custom metrics are helpful.
One of my favorite custom metrics is the following: ‘Did the caller ask to speak to a representative?’. The LLM-based era of Voice AI is still nascent. Many callers ask to speak to a representative just because they have such low expectations for Voice AI after years of fighting with IVRs. But other callers ask to speak with a representative because your Voice AI did something unexpected, like taking too long to respond.
On our platform, Voice AI developers use our custom metrics to track milestones in their Voice AI calls. Rather than sifting through wav
files, looking for cases where a particular event happened in a call, Voice AI builders can simply subset to the calls where the custom metric evaluated as true.
Voice AI Automated Anomaly Detection
Here are two types of Voice AI calls that normally produce successful outcomes:
- Calls with duration in the 75th percentile or above.
- Calls that proceed through all stages of the most common path (i.e., the happiest of happy paths).
When a Voice AI call is long, or follows the happy path, but still doesn't succeed, it tells us that something went wrong. These are the calls you want to examine as the Voice AI developer. These calls are solid gold for improving your agent.
It’s Silicon Valley gospel to talk to your users. But there are so many demands on builders that they often go weeks without talking to a customer. Similarly, Voice AI developers know they should be spending more time with their call data, but each Monday’s ambition to get it done somehow slips away by the next weekend.
So we’ve built ways to spoonfeed you the calls that need your attention. With our Slack integration, we identify calls that should have completed in a successful outcome, but instead had a failed outcome. We push these calls to a dedicated channel on your slack workspace. You can also build your own custom integration with webhooks.
Voice AI Audio Metrics (Latency, Interruptions, Silence, and More)
As a shortcut, some developers read the transcripts rather than listening to calls. If you’re just reading the transcript in a Voice AI call, you’re missing most of the story.
The power and peril of Voice AI is in the richness of voice as a medium. Compared to text, voice interactions are a much more information-rich medium for communication. When we speak, we communicate much more than just the words. We are not Vulcan. We communicate with our pauses, our tone, our inflections, our pace, and so on.
The Canonical AI platform processes the audio to distill the richness of voice interactions into metrics. Latency is always highest on the list of concerns for Voice AI builders. But other audio metrics offer a trove of insights for improving your Voice AI agent.
- Interruptions are a strong indicator of the caller’s experience (and frustration level).
- The percent of the caller’s time spent silent, rather than speaking, tells you if the caller is confused or finding it necessary to coach your Voice AI towards their objective.
- Call duration, as mentioned earlier, has predictive power of call success likelihood.
- ASR model confidence helps identify transcription errors and words you should add to the speech-to-text custom dictionary.
- Signal-to-noise ratio (SNR) tells you about background noise on the caller side. One of our customers changed the time of day for their outbound agent because background noise, and hence ASR errors, was the worst at particular times of day.
By now, reader, you have likely picked up on the fact that we love data visualizations. Raincloud plots a wonderful new type of data visualization. Here is an example raincloud plot for the call duration from a Voice AI agent. The probability density distribution is the cloud. The individual data points are the rain. The boxplot is the earth. Our raincloud plots are a beautiful way to understand the rich and multifaceted interactions people are having with your Voice AI agent.
Why We're Building This
Voice AI assistants are the future. We’re seeing the early signs of it. Voice AI developers are landing contracts with accounts that have so much revenue potential. But what’s the biggest problem we’re all facing as an emerging industry? It's getting their customers to scale up usage of their Voice AI assistants.
The SMBs and enterprises who are exploring Voice AI assistants have their brand reputation and sales funnels on the line. They can't afford for your Voice AI assistant to mess up, so they’re funnelling just a small portion of their call volume to Voice AI. They can't learn to trust what the Voice AI is doing because they only hear a small sample of calls.
But there’s a solution. We’re seeing it working with our users. You can embed all the data visualizations and analyses from this post in your customer-facing dashboards. When Voice AI builders give their customers an easy way to see what's happening in the Voice AI assistant calls, it builds trust in the assistant and it unlocks growth. It's that simple.
Next Step
The winners in Voice AI will be the ones who obsess over their product's rough edges and continuously improve their Voice AI assistant based on real user interactions. If you're not showing your customers what's happening in the calls, you're missing deals and losing out on upsell opportunities. The future belongs to Voice AI developers who embrace transparency and data-driven improvement.
Try out our platform. It’s easy to upload calls with our GUI. Click on sign up, then click on upload calls. And when you’re ready, integration is easy!
If you’d like to embed our data visualizations in your dashboard, here is how to get started.
We love meeting Voice AI builders. Please reach out to us!
Tom and Adrian
January 2025