DuckDB is a protocol
DuckDB is more than a database, it's the protocol on which the next generation of data products are being built.
by Avery Max
You’ve probably heard that DuckDB is the fastest OLAP database, beating brand names like Clickhouse on their own benchmarks. You’ve probably read that DuckDB is the worlds most downloaded database, with 30k stars on Github and over 20M monthly downloads via Python alone. You’ve probably bought in to DuckDBs marketing, which states that it is a “fast analytical database system”.
Unfortunately, it’s all a lie. DuckDB isn’t a database at all, DuckDB is a protocol masquerading as a database, and understanding this is key to knowing what is going to happen next in data.
With few exceptions, every application developed in the last fifty years has been built atop two things: protocols and databases. Databases are useful constructs to store and retrieve data, but protocols drive innovation.
A protocol can yield a seismic shift for humanity (TCP/IP), alter the way we communicate (SMTP/IMAP/POP3), or even just simplify the way computers talk to each other (JSON). A great database is a wonderful tool, and DuckDB is that, but it’s also already proving itself to be the building block for the next generation of data.
Let’s test our hypothesis and hold DuckDB’s little webbed feet to the “it’s a protocol” fire.
Our “Technology Is A Protocol” rubric:
- Embeds into every part of the application stack; front, back, glue in between.
- Provides a common language to facilitate communication between layers.
- Opens a channel for external communication.
- Serves as raw scaffolding on which developers can build products.
Embed everywhere
Need a data wrangling engine for the front end layer? DuckDB WASM. Need a transformations workhorse for ingest pipelines? DuckDB via Python. Need a blazing fast query engine for an enterprise lakehouse? DuckDB in a container. Need a one off or scheduled process that generates parquet and squirrels it away in object storage for later? DuckDB in a Lambda. No question on this one, we pass the embed test.
Common language for all layers
All databases have their own SQL syntax and DuckDB is no different. DuckDB’s is ergonomic and dripping with syntactic sugar, but a SQL syntax does not a protocol make. Yet, there is something special here, a compounding effect. DuckDB’s embed-ability, combined with it’s SQL syntax, creates a universal language that all layers of the stack speak. Having a single data language, without transmogrification, within each component of your application stack is as much of an unlock as is, say, using GRPC/ProtocolBuffers (a protocol!) to pass raw data around.
A tiny tangent
Speaking of ProtoBuffs, maybe the best way to show that DuckDB is a protocol is to take something that is definitively a protocol and draw parallels. ProtoBuffs give us benefits like cross-programming-language compatibility, generated clients, and strong typing. With DuckDB we get similar, but not identical, effects. Cross-programming-language compatibility is replaced by the fact that DuckDB SQL is DuckDB SQL regardless of if you are calling it from C, Python, or Node. “Generated clients” are served by DuckDB releases that speak said syntax, available for virtually any programming language and operating system you might be building on. Finally, strong typing guarantees are made by virtue of the fact that DuckDB SQL run in different environments, but against the same dataset, will result in consistent output (excepting of course the situations listed here).
Back to the rubric, connecting with others
Embeda-bility and common language are the obvious givens, but to see the true power of DuckDB as a protocol we need to make guesses as to what is going to happen in the near future. How long will it be until AWS replaces S3 Select 🪦 with S3 Query powered by DuckDB reading S3 Tables? Inevitable. How long will it be until someone cracks the “distributed duckdb” puzzle (getting closer Smallpond!) and knocks down the final “range anxiety” data volume mental trap locking people into their legacy data warehouses? It’s coming.
There are infinite more future DuckDB scenarios, but they all arrive at the same point. These unlocks, tools, and next generations system are being built on the same query execution engine and speaking the same SQL syntax. It doesn’t matter what the rails of communication are between them. HTTP, GRPC, ODBC, or Arrow Flight, they will all work seamlessly together because DuckDB is itself the protocol and the communication rails are just an implementation detail.
But can you build on it?
The final point, and the one that matters most; does this technology serve as scaffolding for developers to build products? I don’t think we’ve ever seen a tool, specifically in the data space, which has been adopted into products as quickly as we’re seeing with DuckDB. Speaking from our own experience, by using DuckDB in every part of our product we’ve been able to build BI, warehousing, and ETL at what seems like an unbelievable pace.
Confirmed, it’s a protocol
All checks passed, DuckDB is a protocol. We believe embracing this view helps data practitioners maximize the value they get from DuckDB and stay ahead of coming industry trends . We believe DuckDB is a protocol worth building products on top of.
Speaking of building, in the next few days we’ll be announcing more about our company, our DuckDB powered products, and our protocol enabled roadmap.
See you soon.