Voice AI Agent Analytics

Context-Aware Semantic Cache Buddha

We all have stacked engineering roadmaps. But every once in a while, you find easy wins that are quick to implement. The Canonical AI context-aware semantic cache is an easy win. For an hour of engineering work, you get 50 - 200 ms LLM response times and save 50% of your token costs.

You can integrate the Canonical LLM Cache on-premise or over-the network. Either way, the workflow is the same. You first check the semantic cache. If there's a cache hit, then return the response from the cache. If there's a cache miss, then call the LLM API, respond to the user, and update the cache.

Let's start with the over-the-network integration.

API

Many of our users initially start with over-the-network calls. It's a low-risk and easy way to get started. Latency is lower for on-premise (~50 ms) compared to API (~200 ms), so most developers eventually switch to on-premise deployments.

Here's more about calling the API. You make the Canonical Cache API call before you call the LLM API. If the Canonical Cache finds a hit, it returns the cached response. If there is no cache hit, then Canonical returns a 404 and you then call the LLM.

import httpx
import openai
import os
import requests

# create an OpenAI client that points to Canonical
client = openai.OpenAI(
  base_url="https://cacheapp.canonical.chat/",
  http_client=httpx.Client(
      headers={
          "X-Canonical-Api-Key": "<api_key>",
      },
    ),
)

# Instead of sending the request to your LLM, send it to Canonical.
# Catch the 404 on a cache miss.
try:
    completion = client.chat.completions.create(...)
except openai.NotFoundError as e:
    # send to LLM

# After receiving the response from the LLM, send it to Canonical to cache it.
requests.request(
    method="POST",
    url="https://cacheapp.canonical.chat/api/v1/cache",
    headers={
        "Content-Type": "application/json",
        "X-Canonical-Api-Key": "<api_key>",
    },
    data=json.dumps({
        "temperature": "<temperature>",
        "messages": "<msglist>",
    })
)

You can also integrate Canonical via proxy integration using the Open AI base URL. More details are on our GitHub page.

On-Premise

We've built out on-premise LLM Cache to minimize configuration. These non-deterministic machine learning models requires enough fussing over. You don't need to be fussing over your LLM cache as well.

To get started, we'll need to know more about your infrastructure design. Here are more how to deploy the Canonical Cache on premise. Please reach out to us to get started!

Next Steps

If you're interested in trying out our context-aware semantic cache, generate an API key on our homepage for a free two-week demo. Our GitHub examples can help you get started.

Or just reach out! We would love to meet you!

Tom and Adrian
April 2024