Modeling Human Communication To Build Context-Awareness Into LLM Caching
Semantic caching for simple, single-turn LLM applications, like question and answer apps, is easy. All you need is vector similarity search. The semantic cache that comes with your database software works fine for this application.
But getting semantic caching to work in multi-turn, interactive Generative AI applications is a different beast. Effective LLM caching for conversational AI requires understanding the context of each user request. After all, the phrase, “how much does it cost?” can mean two very different things at different points in the same conversation.
How do you make semantic caching work for interactive real-time conversational AI, like Voice AI? We’ve looked to human communication for inspiration. Here are a few of the mechanisms we’re developed at Canonical AI to add contextual awareness to semantic caching.
A Guide To Semantic Caching
Before we discuss the complexity of semantic caching for interactive context-aware AI, let’s revisit the basics of semantic caching. Here are the steps.
- When a user makes a request, the code first searches the semantic cache for questions with the same intent.
- If a question with the same intent is found in the cache, regardless of how it was phrased, then send the answer from the cache to the user.
- If the question is not found in the semantic cache, then send the query to the LLM.
- Send the LLM response to the user.
- Update the cache with the user query and the LLM response. That way, the next time someone asks a question with the same intent, the code can return the answer from the semantic cache.
Multi-turn Cache Keys For High Precision Contextual Semantic Search
Have you ever noticed that when someone says something to you out of the blue, like the person next to you on the train, you often don't understand what they said? There's no context so their words seem garbled. The phenomenon is even more salient when you're speaking a foreign language. Years ago when I was making wine in Spain, my Spanish was good enough to pick up a conversation with anyone, but I was terrified of talking to the grocery store clerk.
In human conversation, we take into account the conversation history to understand what someone’s words mean. Similarly, you can improve the accuracy of semantic caching in conversational AI by incorporating contextual history into the cache's structure.
A cache is made up of key-value pairs. A key is a user query. A value is the LLM response to the user query.
In a simple cosine similarity semantic cache, when a user makes a new query, the code searches the semantic cache to see if any previous user queries match the new user query. This approach leads to lots of false positives in conversational AI. After all, you don’t want your AI app to respond with the same answer to the question “How much does it cost?” regardless of what the user is contextually referring to.
In the Canonical AI semantic cache, each cache key contains previous conversation turns. In many AI applications, the previous LLM response and the user response are required to achieve the target precision (and in some AI applications, more turns are required).
Here is an example multi-turn semantic cache key.
LLM: Hi, thank you for calling Oak Grove Dental. Can I have your name please?
User: My name is Ben.
And here is the corresponding cache value.
LLM: What can I do for you today?
By looking over a longer segment of the conversation, our semantic cache uses context as part of its search criteria to increase the precision of cache hits. You can think of it like language translation – the more history you pull into the translation request, the more accurate the translation.
Understanding Your Audience With Multi-tenant LLM Caching
Knowing your audience is essential for understanding what your audience’s words mean. In human conversation, we remember what a set of people in a particular context have said to us previously.
For example, when I talk to developers, we’re generally talking about technology. In order to understand what a developer is saying to me, I map what they say to me to the history of technology-related conversations I’ve had with developers. In a sense, I am a different person with a different conversational cache for each of my social groups. Multi-tenancy applies this concept from human conversation to LLM caching.
Let’s call each individual cache a bucket (i.e., like buckets on AWS) to disambiguate between the general idea of semantic cache and an individual cache in a multi-tenant caching system that has many caches.
In a simple cosine similarity semantic cache, there is one bucket for all user interactions. A cache hit has a high likelihood of returning false positives. For example, let’s say you’re building a Voice AI receptionist for dental offices. A false positive would be when the cache returns a response from one Voice AI receptionist from Oak Grove Dentistry to a user who is speaking to another AI receptionist from Ygnacio Valley Dentistry.
In the Canonical cache, developers set the scope of the cache. Typically, developers set the cache scope at the AI persona level. In other words, each system prompt by LLM model has its own bucket (e.g., SYSTEM_PROMPT_1
running on GPT-4 Turbo
is one bucket, SYSTEM_PROMPT_1
running on GPT-3.5 Turbo
is a different bucket). With our cache, it’s as if the AI persona has a memory of its interactions with the particular user type.
Sometimes developers ask us if they should set the bucket at the user level. This works well when the app has repeat users (i.e., someone calls a Voice AI to schedule a ride each week), but it isn’t necessary for achieving cache hit rates of 20% and above. We’ve found that developers most often greatly underestimate the amount of AI traffic that can be served by a semantic cache. Even seemingly bespoke interactive Gen AI applications have a high opportunity for caching. You can email us your prompt and we’ll run it through our cache to show you its hit rate.
Metadata Tagging In Semantic Caching To Increase Precision And Recall
In human conversation, it’s rather unbecoming when you call someone by the wrong name. In conversational AI applications, it’s a business risk. By using templatization and metadata tagging, you can ensure you return the correct information (i.e., increase precision) and return more cache hits (i.e., increase recall).
Here how we take a simple semantic cache, like the one described above, and add metadata tagging and templatization. Because latency is paramount, we first search for a hit on the raw user query before searching again using a template query.
When we update the cache with a new LLM response, we substitute the named entities for templates (i.e., “Nice to meet you, Sid. How can I help you today?” becomes “Nice to meet you, <FIRST NAME>. How can I help you?”).
When a new user query arrives (i.e., “My name is Ben”), we use clustering, vector search and other techniques to look for a match in the semantic cache.
If we find a match, we return the response from the semantic cache.
If it’s a cache miss, we substitute out the named entities (i.e., “My name is Ben” becomes “My name is <FIRST NAME>”). We assign the metatag
FIRST NAME: Ben
to the query.We run another search with the template version of the query, “My name is <FIRST NAME>”.
If we find a match, we substitute in the data from the metatag into the cached response, and return it back to the user. “Nice to meet you, <FIRST NAME>. How can I help you today?” with the metatag `FIRST NAME: Ben” becomes “Nice to meet you, Ben. How can I help you today?”
Beyond the Basics: Advanced Techniques In Context-Aware Semantic Caching
We’ve described a few techniques for making a semantic cache aware of the conversational context. But we’ve only scratched the surface. You have to also consider cache invalidation (the only difficult computer science problem other than variable naming), query elaboration, cache variability temperature, context-conditional Retrieval Augmented Generation (RAG), function call caching, and more.
And all this context-aware functionality is moot without accurate semantic search. Accurate semantic search requires much more than simple clustering and cosine distance vector search. Our methods for fast, accurate semantic search are the topic of another blog post.
Moreover, the contextual awareness and the search need to be extremely fast and performant to meet the latency requirements of Voice AI, video AI, and tomorrow’s AI. Users accepted the slowness of LLM applications in 2023 and early 2024. We can’t expect the same patience in the era of GPT-4o.
Get There Faster With The Canonical AI Semantic Cache
We’ve talked to many developers who try LLM caching with a simple cosine similarity search, see the unsurprisingly poor accuracy from this context-agnostic approach, and kick the can on caching’s cost and latency improvements.
An accurate and effective LLM cache needs to understand the context of the conversation with the user. It’s lifetimes of work. Lifetimes that AI developers should spend building their core user product rather than infrastructure.
If you're interested in exploring how the Canonical AI context-aware semantic cache can help you improve your AI quality, then generate an API key on our homepage and get started with our GitHub repo examples. Or reach out to us! We’d love to hear from you!
Tom and Adrian
May 2024