Automatic Speech Recognition For Email Addresses in Voice AI

asr_spoken_letters

Voice AI agents don’t understand spoken letters. In our Voice AI agent analytics platform, we’ve seen many production agents fail spectacularly on a caller’s email address. And does it ever infuriate the caller!

Because many letters are acoustically similar (e.g., ‘b’ and ‘v’), even humans have trouble correctly understanding spelled names, emails and mailing addresses. Automatic Speech Recognition (i.e. Speech-To-Text) systems are especially bad at it.

Modern ASR systems are language models. Context is a key ingredient in the transcription process. This works great for transcribing words in sentences. But if there is no context, no meaning, like in sequences of spoken letters in an email address, they fail.

Transcription of spoken letters is one of the last mile problems in Voice AI. It’s a fascinating problem. Here is a description of two of our own attempts at addressing the issue.

Team Of Experts

In our first attempt, we wanted to avoid training a specialized model. That seemed like a lot of work.

Instead, we modeled a solution after the way people solve this problem in real life. If you have a recording of someone saying something you don’t understand, what do you do? You would share the recording with a few friends and ask them what they hear, distill their opinions into the most likely result, and try to verify the output.

So we built a system that hands a recording of an email address to many different ASR systems, asks GPT to format the results as email addresses and rank the most likely candidate email, and finally sends the result to an email verification service to see if the email exists. It’s a Team-of-Experts approach.

Dataset

Our test dataset was about 50 recordings of spelled email addresses, mostly from different people from various parts of the US.

Methodology

We wanted to compare our Team of Experts to the current practice in Voice AI. In a production Voice AI agent, the caller’s audio is streamed to the Speech-To-Text service, the text is passed to an LLM, and the LLM’s output is passed to a Text-To-Speech service (i.e., the Voice AI agent cascade).

For our baseline, we sent the entire recording of the email address to Deepgram’s Nova-2-general model, then asked GPT to format the result as an email. In this sense, it doesn’t perfectly emulate current practices. The ASR model gets to look forward and ahead, rather than receiving only current and past audio chunks, which improves accuracy at least some. Moreover, Voice AI developers could run a tool call that asks an LLM to reformat the email address during the call, but it doesn’t seem to be frequently done.

We tried a variety of permutations for our Team of Experts. We tried commercial providers and open-source models like wav2letter that take different approaches. Our best performing pipeline first transcribed the entire (not-streamed) email address with Deepgram Nova-1-phonecall (more accurate than other Deepgram models at letter transcription), Assembly AI’s Universal-1 (best tier) model, and OpenAI audio-in-text-out GPT-4o-audio-preview-2024-10-01. Then, the pipeline asked GPT-4o to generate and rank the most likely email address from the three transcriptions. Finally, the pipeline sends the email address candidates to Emailable for verification. The output was the most likely verified email. If no emails were verified, then the output was the most likely email.

Note that we didn’t preprocess the audio in our pipeline. Even the most sophisticated noise reducing models and voice isolators introduced distortions to the speaker’s audio that negatively affected transcription accuracy.

Result

The baseline approach, which emulates current practices in Voice AI, produced as its final output the correct email address for 53% of the email addresses.

Our Team of Experts approach produced the correct email address for 74% of the email addresses.

The academic literature seems to point to 90% for human-level recognition of spelled letters over the phone (we couldn’t find a definitive source). Both the current practice approach and the Blind Leading The Blind approach were well below 90%.

An ASR Model For Spoken Letters

One reason ASR systems fail to accurately transcribe spoken letters is that they try to find context where there is none. They’re looking up at the clouds, seeing faces and animals.

One of the things we learned from our last company was the unreasonable effectiveness of computer vision models. We trained models that looked at iPhone pictures of crops to assess their thirstiness or diagnose viral disease symptoms. Computer vision has handily conquered the written letter recognition task in the MNIST dataset. Could computer vision learn to recognize visual representations of individual spoken letters better than modern ASR systems? Converting the audio to a spectrogram is generally part of an ASR pipeline, so what if we just stopped there rather than bringing in transformers and looking for context?

Dataset

At first, we thought we could just generate individual files of spoken letters with Text-To-Speech models. But this had some obvious problems.

First, nobody in real life says their email address like they’re narrating a bedtime story. For this project, I listened to a lot of synthetically-generated audio from the best TTS providers. It was painfully clear just how much training content for even the highest-praised TTS models comes from people reading books aloud.

Second, check out the video above. Is this how you say the name of one of the world’s most valuable companies? If TTS is getting ‘apple’ wrong, what else is it screwing up when your Voice AI agent speaks to callers? The state of Voice AI is developers just hacking these Voice AI systems together with no appreciation of the final user experience. The code runs, let’s ship it! Why should I be bothered with what my users think?! I don’t have time to listen to calls! Who needs a boat that floats when we’ve got this bucket for bailing water?! Fortunately, for the real craftspeople, there is tooling that helps you see your agent’s problems and fix them.

So instead of using TTS models to generate the training data, we found some old academic dataset of about 1,000 real humans reciting the alphabet over the phone. We segmented the audio files into individual letters. Next, we wanted to approximate the data we would get from an audio frame processor. We concatenated three to four individual letters in random order, and sometimes had partial letters at the start and end. Lastly, we converted the wav files to spectrograms, which are time (x-axis) by frequency (y-axis) visual representations of the audio files.

Methodology

We split the dataset up into training, validation and test sets. If a speaker was in the training data, that same speaker was only in the training data and not in the validation or test data. In other words, when we tested the model, the model had never encountered the speaker during training. We decided to train an object detection model because we didn’t want to introduce potential error from a model that segments letters (VAD also models struggle with spoken letters). We wanted just one model that did both segmentation and classification. We used Google’s Vertex AI for training the model.

Results

asr_spoken_letters

The precision and recall were both 87%. This is on par with the approximately 90% human-level accuracy for spoken letter recognition over the phone.

In this example, the test audio file was C D B. When I hover over a label on the right of the screen, you can see the bounding box light up on the spectrogram on the left side of the screen. C D B. Nailed it. First try. And look at how well it finds the boundaries between letters.

In this example, the test audio file was F N B. Like ASR systems and humans, the model struggles with distinguishing between ‘b’ and ‘v’.

What happens when you hand the model a frame of someone spelling their email address? This person in this recording slowly said the start of their email address, A N. The model identifies the different letters as separate, but mistakes the first letter for ‘n’ instead of the actual ‘a’. It gets the second letter correct.

And what happens when you hand the model a frame of someone spelling their email address at normal speed? Answer: total failure. All these labels are wrong. My training dataset expects pauses between letters.

Conclusion

We think the spectrogram approach could work for identifying spoken letters. After all, computer vision made easy work of the MNIST handwriting dataset. However, spectrograms of letters, compared to written letters, are a lot trickier. For one thing, people speak in cursive; even when you ask them to speak slowly they still blend letters. And then there’s issues with audio quality, accents, etc. We think if we increased the dataset by ten to one hundred times, we might be able to get it. But that’s a much bigger project than we can do on the side while building our core Voice AI agent analytics product.

Thanks for reading! If you have questions or ideas about this project, please reach out to us. We’d love to hear from you.

Tom and Adrian
February 2025