RAG is A Band-Aid. Gemini 2.0 Flash-Lite Is All You Need
If you’ve built with Retrieval Augmented Generation (RAG), you know how hard it is to retrieve the correct result from your content. But if you put a knowledge base in Gemini 2.0 Flash-Lite’s system prompt, rather than using RAG, you can get the right answer almost every time.
Previously, this approach was too slow and expensive for conversational AI applications, such as Voice AI agents. But with the newly-released Gemini 2.0 Flash-Lite and prompt caching, it’s cheap enough, fast enough, and has a long enough context window (1M tokens, ~1,500 pages) to work.
There’s still a place for RAG. If you have a massive corpus that exceeds the context window length, you need RAG. If your queries need to output a lot of tokens, much more than you would output in a conversational AI application, then the long output would increase latency and you may still want to use RAG. There are other cases where you may still choose RAG for one reason or another.
But for conversational AI, Gemini 2.0 Flash-Lite is all you need.
How To Use The Context Window Instead of RAG
Here’s how Model-Assisted Generation (MAG) works.
First, you put the knowledge base content in a Gemini 2.0 Flash-Lite system prompt.
Second, you make a tool call to Gemini 2.0 Flash-Lite. For example, in your Voice AI agent’s system prompt, you write, “When the user asks about the rule book, use the query_knowledge_base tool.”
Third, get the right answer from your knowledge base.
MAG Accuracy, Latency and Cost
Results
Accuracy: In our test dataset, a closed Voice orchestration platform’s built-in RAG found the right answer 36% of the time. Gemini 2.0 Flash-Lite got it right every single time.
Latency: Time To Last Token was 0.93 seconds for a 38.5k input token knowledge base (~50 pages) and ~40 output tokens, even without prompt caching.
Cost: Without prompt caching, 300 queries per day on a 38.5k token knowledge base (~50 pages) costs about $26 per month in Gemini 2.0 Flash-Lite token costs. With prompt caching, it’s about $7 per month in Gemini 2.0 Flash-Lite token costs.
MAG Accuracy
The accuracy level is in no way a surprising result. The leading LLMs are benchmarked against the Needle In A Haystack (NIAH) problem. They’re excellent at finding what they’re looking for in the system prompt. We can’t find NIAH benchmarks for Gemini 2.0 Flash-Lite yet, but here we see Gemini 1.5 Pro get 99.7% recall. It just works. Period.
RAG, on the other hand, is difficult to get right. When it fails, it’s not in the generation step. It fails in retrieving the relevant chunks from the content. With painstaking dataset preparation, you can get good results. But you have to spend a lot of time and care on munging your dataset and configuring your RAG pipeline. There is no one solution that will work for every dataset.
This challenge becomes even more acute when dealing with voice interactions, where users speak in natural, conversational language rather than crafting precise search queries. Natural speech patterns – with their vague references, contextual assumptions, and incomplete thoughts – make it extremely difficult for RAG systems to retrieve the right information. Our tests showed that traditional RAG approaches achieved only 37% accuracy in voice interactions, while a context-loaded LLM approach reached 100% accuracy.
The results from our test dataset tell the same story. We used the CrossFit Games Rulebook as our content. To test RAG, we uploaded the content as a text file to one of the leading closed-source Voice orchestration platforms. To test Gemini 2.0 Flash-Lite, we put the content in the system prompt. We asked 30 questions and hand-labeled the results (i.e., Adrian and I, who are both humans and not judgy LLMs, reviewed the output to assess accuracy). We found that the native RAG solution fails most of the time (73% of the time it can’t find the answer), but Gemini 2.0 Flash-Lite gets the answer right every… single… time.
MAG Latency
Our mean, minimum, and maximum Time To Last Token were 0.93 seconds, 0.73 seconds, and 1.18 seconds. (We weren’t streaming, hence we used the Time To Last Token metric.) The golden latency threshold in Voice AI is 500 milliseconds for Time To First Token. At first blush, 930 milliseconds for Time To Last Token for generating 40 tokens may seem a bit too slow.
But two things…
Voice AI orchestration platforms are not achieving 500 millisecond Time To First Token. You can use our Voice AI analytics platform to check your own agent. If you think you’re getting 500 milliseconds, you’re being lied to.
Second, we’re making a function call, so you can use the same function call tricks for shoring up latency. In the Voice AI world, it’s common to instruct the agent to tell the user to hold on a second before making a function call (i.e., “First, tell the user to hold on a second, then run the query_knowledge_base function”). After all, we have a similar experience when we talk to real humans on the phone and ask them to do something, like reserving a table at a restaurant.
MAG Cost
If you’re making 300 queries per day on a 50 page knowledge base, then the Gemini 2.0 Flash-Lite token cost, without prompt caching, works out to about $26 per month. That’s on par with the cost for similar usage amounts from RAG-as-a-service providers.
Prompt caching (‘Context Caching’ in Google-speak) isn’t available for Gemini 2.0 Flash-Lite yet. But when it becomes available, the cost goes down to about $7 per month. Imagine that…running 9,000 queries per month on 38.5k tokens in the context window and it only costs $7. Maybe that’s less stunning in the age of DeepSeek, but it’s still a marvel.
The prompt caching part is a little complicated because you have to rent the prompt cache storage space from Google. Our cost model assumes users are only querying the RAG for 12 hours per day (i.e., we only need to rent the prompt cache tokens for 12 hours).
The Gemini suite, by the way, is the only top-tier model that gives you control over the cache validation time. The other model providers, perhaps to disincentivize you taking advantage of the cost saving, invalidate the cache after five minutes to about an hour.
And remember that RAG isn’t perfectly efficient with tokens either. Passing into a model 3 retrieved chunks at 512 tokens each, per request, adds up significantly.
Model-Assisted Generation with Pipecat Demo
We used Pipecat to build a demo of MAG for Voice AI. Here’s a video of us talking to the agent, asking questions about the CrossFit Games Rulebook. You can find the Pipecat Model-Assisted Generation code here.
Here is a video of the Pipecat demo.
Model-Assisted Generation For Claude and GPT
Do GPT and Claude perform well enough to replace RAG? No.
Gemini 2.0 Flash-Lite is the only model that responds fast enough, responds reliably given a long system prompt, works out to be comparably priced to RAG-as-a-service providers, and (coming soon) has a configurable cache.
Voice AI Agent Memory
For some Voice AI applications, you may want the agent to remember something about a returning user. When a Voice AI agent starts a new session, the agent has no memory of its previous conversations with a user.
Gemini 2.0 Flash-Lite’s near perfect recall and ~1500 page context window could enable agent memory. After each conversation with a user, you would ask Gemini 2.0 Flash-Lite to add the key points from the transcript to the appropriate place in the memory system prompt. It would be loosely analogous to human memory consolidation and formation.
In technology, where you find the bandaids and workarounds is where you can see the future. It’s obvious that it will soon seem strange that AI apps didn’t remember us between sessions.
RAG itself is one of these bandaids - a workaround for the current limitations of LLMs. Rather than truly understanding and remembering information, we're forcing search mechanics into a conversational interface. The fact that engineers are spending countless hours fine-tuning RAG pipelines just to achieve mediocre results signals that we're solving the wrong problem. The future isn't about better search - it's about AI systems that can hold and process information as naturally as humans do.
Next Steps
Stop messing up your product with RAG. Use Gemini 2.0 Flash-Lite instead.
If you have questions about how to build this, check out our Pipecat demo in the link above. If you still have questions, please reach out to us. We’d love to hear from you.
Tom and Adrian
February 2025