Semantic Caching FAQ
What is semantic caching?
A semantic cache (i.e., LLM cache) is a store of questions (keys) and answers (values) that obviates the need to call a LLM. If the user query has been asked previously, even if the phrasing is different, then the semantic cache returns the response – rather than the LLM. It gets its name because the cache semantically searches the previous queries for a match. That is, an exact string match is not required for a cache hit.
Here’s an example of a simple semantic cache.
- When a user makes a request, the code first searches the semantic cache for questions with the same intent.
- If a question with the same intent is found in the cache, regardless of how it was phrased, then send the answer from the cache to the user.
- If the question is not found in the semantic cache, then send the query to the LLM.
- Send the LLM response to the user.
- Update the cache with the user query and the LLM response. That way, the next time someone asks a question with the same intent, the code can return the answer from the semantic cache.
How is semantic caching different from key-value caching
Key-value caching, or KV caching, is shorthand for caching that requires an exact string match for the cache key. In other words, the cache only returns a response if the user query exactly matches a previous user query.
In semantic caching, you don’t need an exact string match. Instead, you search the cache for queries that are semantically the same. That is, you search the cache for previously asked queries that have the same intent, even if the phrasing is different.
Where does a semantic cache fit in the LLM tech stack?
The semantic cache sits in front of the Large Language Model (LLM). Before calling the LLM, developers call the semantic cache. If they find a match to the user query in the semantic cache, then there’s no reason to call the LLM.
Does semantic caching improve LLM latency?
Yes. If you self-host our semantic cache, it responds in about 50 milliseconds. If you call the cache via an API, it responds in about 200 milliseconds.
Does semantic caching help save on LLM API costs?
Yes. A response from a cache is cheaper (and faster) than a response from an LLM. For the Canonical AI semantic cache, we only charge on cache hits. A cache hit saves you 50% on your LLM API fees.
Can I cache audio files and video files?
Yes. Text-to-speech and video generation are expensive so semantic caching is a no-brainer for these applications. Here is how it works. On a cache miss, you update the cache with a) the user query, b) the LLM text response and c) a url to the audio or video file as metadata attached to the LLM text response. Then, when a user makes a new query, we search the cache for previous user queries that match the new user query. If you find a match in the cache, then return from the cached LLM response with the url to the audio/video file as metadata. The developer then serves the audio/video file to the user.
Won’t caching reduce the accuracy of my LLM application?
Not if you do it well. Semantic caching is essentially a search problem. We break the search into two steps. The first step is identifying candidate matches based on the conversational context and the user query. In the second step, we use different search technologies that each work in different ways to make sure the candidate match really is a match.
That is, in match candidates where the string match is high, we perform an additional embedding search with a higher threshold than the first-pass embedding search. With this approach, we correctly classify, “What are the fees on the Chase Sapphire card?’ and “what are the fees on the Chase Sapphire Reserve card?” as different and don’t return a cache hit, even though the cosine similarity on these two strings is 0.99.
What is the hit rate for a semantic cache?
It depends on the application. The more repetition you find in a typical user-LLM interaction, the higher the hit rate. But even in applications where each user interaction seems bespoke, there is more opportunity for caching that developers tend to expect.
How do I make sure the semantic cache returns the right answer for each of my customers?
The Canonical AI semantic cache is multitenant. Each AI personal has its own cache so a user doesn’t get a cached response from the wrong AI persona. For example, let’s say a developer is building an appointment booking Voice AI for medical clinics. The developer sets the scope of the caching such that each there is one cache per clinic. This ensures that responses from one clinic don’t get returned to a caller who is talking to another clinic.
In practice, we accomplish this by creating a new cache for each system prompt and model pair. For example, If you change a system prompt but keep using the same LLM model (gpt-4-turbo-2024-04-09, for example), then you get a new empty cache. The same is true if you change the LLM model but keep using the same prompt. You can also specify the bucket in the Canonical AI cache client configuration header. You can read more about it here.
Will the cache leak PII?
No, we filter out Personally Identifiable Information (PII) from the LLM response before updating the cache.
I’m already making a call to the cache. Can I also run RAG on the same call?
Yes, the Canonical AI cache has retrieval built into it. When you call our API, we search the cache. On a cache miss, we run a vector search on the uploaded data and return the relevant data chunks. The developer then passes the data into an LLM and gets a response. The response is served to the user and the response is sent to our cache.
On a cache hit, we return the response from the cache. There's no call to a RAG system or a LLM.
You can read more about it here.
Is semantic caching only useful for RAG applications or can it work in conversational applications as well?
We're focused on making semantic caching work for conversational application. Semantic caching for Q&A RAG applications (i.e., not conversational AI but just one-off questions) is straightforward, but also has limited utility.
It's a lot harder to get semantic caching to work when a user is conversing with the AI. We've built things like adding multiple conversation turns into each cache key, metadata tagging, templatization, and other methods to make semantic caching work for conversations. You can read more about it here.
Yes, our cache works with template-populated system prompts. I’ll first describe two general concepts in our product, then I’ll address how these concepts apply to your items.
Do you use templating to improve the semantic cache hit rate?
Yes. When we get a user query, we first look for a match in the cache. In some cases, there’s no match in the cache because the user query mentions Personally Identifiable Information (PII). We filter out the PII from the user query (i.e., "Tom" becomes "<FIRST NAME>"), then search the cache again. The cache has previous user queries that also have PII filtered out. If there’s a match between the templated user query and the templated previous queries in the cache, we have a cache hit. We then take the cache response, substitute back in the PII, and return it to you.
You can read more about it here.
Won’t semantic caching make my AI sound robotic?
We recognize that LLM response variability can be a feature, not a bug. For this reason, we added a temperature parameter to our semantic cache.
At a semantic cache temperature of zero, there is only one value for each key. On a cache hit, there is only one possible answer that gets returned to the user from the cache.
As the semantic cache temperature increases, the number of values associated with each key also increases. On a cache hit, the code randomly picks from the list of predetermined answers. In effect, the AI doesn’t repeat itself verbatim in a robotic way, but it still sticks to its script.
You can read more about it here.
Is semantic caching based on embeddings?
Semantic caching is based on embeddings. We use embeddings to find the intent of the query. If we find a previously-asked query with the same intent, even if the phrasing is different, we can return the response from the cache.
How do you populate a semantic cache?
For most applications, developers let the LLM organically populate the Canonical AI semantic cache. That is, a new cache starts empty. The first user query is by definition a miss. The LLM responds. The response gets served to the user and becomes the first cached response.
You can also pre-populate a semantic cache. This approach is helpful for trying to prevent LLM hallucinations.
Can I invalidate the cache?
Yes, cache invalidation is easy, just like variable naming.
Can I try out the Canonical AI semantic cache?
Yes, generate an API key on our on our homepage for a free two-week trial!
If you would like our help getting set up, reach out! We’ll set up a slack to help you get started. We’d love to meet you!