How To Run A/B Testing For Voice AI Agents
The biggest problem in Voice AI is getting the agent to reliably do its job.
Perhaps you find yourself in a similar position as most Voice AI developers. You’ve landed some stunning logos, but your customers have only turned on your agent for a few of their stores. You’re getting their innovation budget, but not that operational budget that’ll make or break your A round. Your customers just don’t trust your Voice AI agents because they still do unexpected things when they interact with real humans.
There are an infinite number of permutations you can try in order to make your Voice AI agent more reliable. Which changes should you make to your system prompt, your tool call APIs, your model configurations (i.e., TTS voice selection), and your orchestration stack? How do you know which change resulted in getting your Voice AI agent closer to consistently executing on its objectives?
The only way to know is to test different versions of your Voice AI agent as it interacts with real humans. Testing and simulation won’t get you there.
A/B testing (i.e., split testing) helps you make data-driven decisions about your Voice AI assistants. In this blogpost, we walk you through how to use split testing of production call data for delivering the most effective Voice AI agents.
How To Run A/B Testing For Voice AI Agents
Voice AI agents are the new websites. Just like websites, a Voice AI agent is a means for customers to learn more about a business or transact with a business. In website development, split testing is a common practice for building great websites with high conversion rates. The same process can be applied to Voice AI agents.
In A/B testing, we start with our production Voice AI agent and an hypothesis about how to make it better. Then we create a new version of the agent. Next, we create controlled experiments (i.e., campaigns) by splitting traffic between the production version and the new variant. Finally, we measure the impact on key metrics.
Create Hypothesis
The first step is to create a hypothesis. What is a plausible reason your agent isn’t performing like you want? Your hypothesis may mean just a simple change to the system prompt. Or you may want to rethink one of your tool call APIs. Or maybe you need to fiddle with the rate of speech generation. Or maybe you need to migrate to a new orchestration stack.
Let’s use one of my toy Voice AI agents as an example. I built a Voice AI agent that interviews YC founders about their product, their company, or their founder journey. It then turned the interview into social media posts. In some interviews, the founder opened up and the resulting social content was interesting. In other interviews, the founder was closed and skeptical and the resulting content was bland.
My hypothesis was that the Voice AI assistant didn’t seem personable enough. So I created a new version of my agent that first asks the founder about how their day is going.
Create A/B Testing Campaign For Your Voice AI Agent
Once you have your hypothesis, then you need to create a campaign. On the Canonical AI dashboard, click on Campaigns
at the top of the dashboard, then click on Create Campaign
.
Next, type in the name of the campaign and the description. You’ll end up having a lot of campaigns so we recommend verbose inputs. Note that neither of these fields are used in any LLM prompt downstream.
For my podcast host example, I named my campaign ‘Podcast Guest Authenticity’.
In the Campaign Outcomes section, we add metrics for measuring changes in the agent’s performance. In other words, what outcome are you hoping your new version will achieve with greater frequency compared to your current production version?
For my podcast host agent, I wanted to see if more calls led to the founder sharing something personal about themselves.
We now specify whether we want to compare two different agents. Or if we want to test the same agent for one period of time (i.e., before you made a change to it) against a later period of time (i.e., after you made a change to it).
In my example, I uploaded the two versions of my agent as two different agents. Hence, I’m comparing two different agents in my Voice AI A/B test.
Interpret Results from Your Voice AI Split Testing Campaign
Finally, here’s the most fun part – learning from the results! You can see if your change to your agent led to changes in users’ behavior. You can use the results to decide whether you should merge your new version of your agent into production.
In my example, we see that asking the founder about their day didn’t lead to a statistically meaningful higher rate of the founder sharing more about themselves. People are nature’s most unpredictable force. They never do what you expect. I guess it’s back to the drawing board.
Next Steps
Garry Tan recently posted that building great products requires testing with real users. You can’t figure out what people want from simulations – at least not until we have AGI. You have to try new ideas and see if real people want what you’ve built.
We’d love it if you tried out A/B testing on our platform. It’s easy to upload calls for two versions of your agent with our GUI then create your first campaign. Click on sign up, then click on upload calls, then follow the steps from this post. And when you’re ready, integration is easy!
And we love meeting Voice AI builders. Please reach out to us!
Tom and Adrian
January 2025