Serverless single tenant RAG with DuckDB

Summer’s forthcoming AI features allow prompting for SQL generation, chart generation, information about data governance, and even summaries or pointers to conversations happening around interesting datapoints. The challenge is that each user within an org has a different view of all these component pieces, depending on what access they’ve been granted. To provide these details to an LLM alongside user generated prompts, without overwhelming the context window or driving up cost, we need to first assemble the user’s view of Summer and then filter it down to the most relevant pieces.

This is the standard use case for a RAG workflow, where pieces of content are reduced by a model to arrays of floats, called embeddings, and math™️ is used to determine how closely related other embeddings (say of a prompt) are. The closer the relation, the more relevant the content, the more likely it is to be helpful if included in the prompt being sent to the LLM.

Here is our RAG complication: pulling together a user’s entire view of Summer is a heavy operation. We need to account for tables, queries, dashboards, connectors, catalog data, comment threads (spoiler alert), and more. Pulling this data every time a user prompts our AI would be a burden on our authorization and product systems, so we challenged ourselves to design a RAG system that is:

Pre populated. When a prompt is issued the only step is to query the embeddings.
Single tenant. Each “instance” of the system is safe to query without fear of someone accessing content they don’t have permission to.
Zero, or close to zero, burden on our existing production systems.

Introducing DuckRAG

DuckRAG is our system for creating a pre-computed, pre-authorized, and isolated view of Summer components for each individual user. The output of DuckRAG is a per-user DuckDB database file, stored on S3, which contains the id, type, embeddings, and text content of every component a user has access to on Summer. When a user enters an AI prompt, we generate embeddings for it, attach the user’s DuckRAG database to a DuckDB instance, and leveraging DuckDB’s VSS extension, issue an array_cosine_distance() vector query to pull the most similar results. We don’t interact with our authorization or product databases, and the load on our existing systems is zero.

Getting data into DuckRAG

In order to get data into each user’s DuckRAG, we employ the “fan-out-write” strategy made popular by companies like Twitter for generating personalized user feeds. When a new component is created or modified on Summer, we generate the embeddings for it, pull the list of all users with access, and place one event per user onto a stream. A stream processor picks up the message, pulls the DuckDB file for the user, writes the new record, and saves it back to S3.

The obvious tradeoff is that we are storing the same piece of information N times, where N is the number of users who have access. The tradeoff works out because storage is inexpensive and for our use case the amount of replication isn’t something to worry about. For example, within Summer’s own Summer account, we have eight users, each of which have DuckRAG files containing between 150 to 260 tables, 110 to 215 queries, and various other bits and pieces. Each file weighs between 4MB and 6MB. Rounding up, that’s just shy of 50MB stored on S3 per month for a cost of $0.001. Even factoring in growth over time, the system more than pays for itself with the reduction of load on our other production deployments.

Getting data out of DuckRAG

This is the fun part! Let’s say a request comes in to our prompt API with the following attributes:

{
  "user_id": "user123",
  "prompt": "What are the 35 best reviewed ice cream flavors?",
  "context": "sql"
}

DuckRAG takes the following steps:

# Set a maximium distance filter to get only most relevant results
score_filter = 0.65

# Step 1: Generate the embedding of the user's prompt

# This is an OpenAI SDK client, but the embeddings can be generated by
# any hosted embedding API or even something self run!
response = client.embeddings.create(
    input=req.prompt,
    model="text-embedding-3-small"
)

prompt_embedding = response.data[0].embedding

# Step 2: Attach the user's DuckRAG file.
conn = duckdb.conn()
conn.execute(f"ATTACH 's3://duckrag-bucket/{req.user_id}.duckdb' AS duckrag (READ_ONLY); USE duckrag;")

# Step 3: Query DuckDB for the closest items
# Since our prompt had a "sql" context, let's pull just the
# tables that are most relevant
result = conn.execute(f"""
SELECT  array_cosine_distance(embedding, ?::FLOAT[1536]) as score,
        entity_id,
        content
FROM embeddings
WHERE score <= ?
AND entity_type = 'table'
ORDER BY score
""", (prompt_embedding, score_filter))

In this case, for the prompt issued above we get back two tables, sample_data_icecream_reviews.icecream_reviews and sample_data_icecream_products.icecream_products:

score	entity_id	content
0.58011603	tbl_rxc0nojLY4Hx8P…	summer sample_data_icecream_reviews icecream_reviews : “x_summer_id” VARCHAR NOT NULL,”_ab_source_file_last_modified” VARCHAR,”_ab_source_file_url” VARCHAR,“brand” VARCHAR,…
0.62355006	tbl_gZzH87v9X0jamn…	summer sample_data_icecream_products icecream_products : “x_summer_id” VARCHAR NOT NULL,”_ab_source_file_last_modified” VARCHAR,”_ab_source_file_url” VARCHAR,“brand” VARCHA…

You can see, we got back the two most relevant tables and their text representations, which we can now happily send along with the user’s prompt to an LLM for SQL generation. To visually see this in action, here is a test harness we have for DuckRAG. You can watch the prompt above delivering both the tables and existing queries that are most relevant:

Next Steps

We’re not quite done with DuckRAG, and we have a few improvements to make before open sourcing it.

A light form of hot caching DuckRAG files on query instances with simple invalidate-on-write mechanics, so that for a user doing multiple prompts in a short period of time, we wont be constantly attaching/detaching their DuckRAG file.
Multiple writes per user when available. If a stream shard processor pulls 100 events at a time, and N of those all apply to the same user, we can batch them and write them at once. Today all writes are one at a time. Shards are keyed by user_id, so it’s safe to do this.
Experimentation with other embedding models and vector search functions.

We’re pleased with the performance of DuckRAG so far and impressed with how fast DuckDB is as a stand-in vector database. We have a lot more AI powered features coming, and DuckRAG is proving to be a reliable piece of scaffolding on which to build.

While discussing DuckRAG with some of our early partners and customers, we’ve had requests to make this architecture available as a product offering. If you would be interested in DuckRAG, as either an open source library or a hosted solution, reach out on Slack and let us know!