Semantic Cache RAG
If your Generative AI app uses Retrieval Augmented Generation (RAG), then you likely can save 30% or more of your Large Language Model (LLM) requests with semantic caching.
Your LLM API fees are part of your Cost of Goods Sold. The times are good now for LLM companies. But when the music stops, investors are going to start paying a lot more attention to your COGS and Gross Margin. Semantic caching is the low hanging fruit for lowering your COGS.
Context-Aware Semantic Caching for Conversational RAG
If your users have more than one conversation turn with your app, then a simple semantic cache (i.e., similarity search, GPTCache, Redis, langchain) won’t be accurate. The semantic cache won’t know what the user means when they refer to something they said earlier in the conversation. In order for caching to work for conversational RAG, the semantic cache has to be aware of the context of the user query.
At Canonical AI, we’ve developed a number of innovative methods for addressing the user query context issue for semantic caching. These techniques include multi-turn cache keys, developer-defined cache scoping, cache temperature, and others (you can read more about it here).
We’ve also built the Retrieval part of RAG into our caching layer. Here’s more about how it works.
Semantic Cache RAG
Here are the steps for the Canonical AI Semantic Cache Retrieval pipeline:
Step 0. The developer uploads documents to Canonical AI via an API call.
Step 1. The user makes a query. We search the semantic cache for previously asked queries with the same intent – even if the phrasing is different (hence, “semantic” cache).
Step 2. If we find the user query with the same intent (i.e., a cache hit), we return the response from the cache. No LLM call needed!
Step 3. If the query has not been previously cached (i.e., a cache miss), we run a retrieval search on the uploaded knowledge.
Step 4. We return the retrieved knowledge to the developer.
Step 5. The developer passes the retrieved knowledge to an LLM to synthesize a response for the user.
Step 6. The developer sends the LLM response to the user.
Step 7. The developer updates our cache with the LLM response. The next time the same question is answered, we return the answer from the cache.
Context-Aware RAG for Conversational AI
Sometimes the appropriate answer to a user query is in the uploaded knowledge. And sometimes the appropriate answer is the unaugmented LLM response. In a conversational interaction with a user, the context of the user’s query dictates whether or not to run RAG.
To address this issue, on each and every cache miss, Canonical AI runs the retrieval search using the last four conversation turns as search input.
The Resourceful Builder
At our last company, we helped farmers make irrigation decisions. The technology, developed at UC Davis, used an entirely new approach. It monitored not the water in the soil, but the water the plants were using. It took ten years for farmers to shift the way they thought about irrigation management, but eventually they did. We were able to keep going until the market caught up. We succeeded because we were resourceful.
The most successful founders we know in the YC community aren't smarter, or better connected, or prescient. They're resourceful.
Adding smart, context-aware semantic caching to your RAG application is resourceful.
If you would like to try it out, you can generate an API key on our homepage for a free two-week trial.
If you would like our help getting set up, reach out! We’ll set up a slack to help you get started. We’d love to meet you!
Tom and Adrian
June 2024