Tool Spotlight: DuckDB

This article was originally posted on Linkedin and has only been moved here recently
DuckDB is a free and open-source, lightweight, relational database management system designed for OLAP workloads characterized by more complex and longer-running queries with lesser frequencies than OLTP workloads. It was initially developed by the Database Architectures Group at CWI, Amsterdam.
Drawing inspiration from SQLite, DuckDB has no dependencies and runs in-process, which means it does not have to make any remote calls to external processes, making it very fast and portable. Despite its simplicity, it supports complex querying in SQL with complex data types like maps and structs and boasts a decent function library. It also has client APIs in popular languages like Python, Java, C++, and Nodejs.
One area where it particularly shines is in interactive data analysis. Typically, Data Professionals use tools like Python and R for exploratory data analysis as well as insight discovery and reporting. Libraries like Pandas and Dplyr have become very popular for these kinds of tasks and DuckDB seems to complement these tools by interoperating with them. For example, DuckDB can read various file formats such as Parquet and CSV, and query them using SQL. The result of the queries can also be expressed as Pandas dataframes. In addition, apart from SQL, there exists a Relational API for dataframe-style querying via method chaining.
To explore how easy it is to install and use DuckDB, first we install it (assuming a satisfactory Python environment already exists):
pip install duckdb==0.5.1
Next, we can import it and create a connection object
import duckdb
conn = duckdb.connect()
We can create a table within the main schema which is the default namespace and populate it using the popular Iris dataset (add link to Iris dataset)
conn.execute(""
CREATE TABLE IF NOT EXISTS iris_data(sepal_length FLOAT,
sepal_width FLOAT,
petal_length FLOAT,
petal_width FLOAT,
flower_class VARCHAR );
COPY iris_data FROM '~/downloads/iris.data' ( DELIMITER ',', HEADER );
""")
This table can then be queried with SQL normally. The result can be transformed into Pandas dataframe for easy viewing in an interactive environment like Jupyter notebook
result_df = conn.execute('SELECT * FROM iris_data LIMIT 5').fetch_df()
With result:
We can use aggregations as well
print(conn.execute("SELECT flower_class,
COUNT(*) as flower_group_cnt \
FROM iris_data \
GROUP BY 1 ").fetchall())
With results that can be easily unpacked:
[('Iris-versicolor', 50), ('Iris-virginica', 50), ('Iris-setosa', 49)]
The relational API is still under active development but it provides an alternative to SQL for querying. The same grouping above can be done in the following manner
# creating the relation object
iris_rel = conn.table("iris_data")
# querying the relation object
iris_rel.aggregate("flower_class, count()")
The SQL example above was quite simple but this does not reflect the full power of DuckDB as it can handle much more complex SQL queries
res = conn.execute("""
WITH init AS (
SELECT *,
petal_length + petal_width AS petal_area
FROM iris_data
)
SELECT flower_class,
petal_length,
petal_width,
avg(petal_area)
over (partition by flower_class)
as class_petal_area_avg
FROM init
QUALIFY class_petal_area_avg >= 2.5
LIMIT 5
""").fetchdf()
Finally, the database can be exported to a directory using a given file storage format such as Parquet. The directory has the data files as well as 2 other files; load.sql and schema.sql which hold the table definitions and the copy commands to load the data into the tables.
This command:
conn.execute("""
EXPORT DATABASE '/Users/test_space/iris_db'
(FORMAT PARQUET);
""")
produces the following directory:
iris_db
├── iris_data.parquet
├── load.sql
└── schema.sql
This directory can be used in an import command to easily get this database up and running.
From this little demo, the features of DuckDB are quite exciting and have the potential to change the way interactive analytics is done by supporting larger datasets than typical tools like Pandas without the need for distributed platforms like Spark. However, many tools come and go and if this one is to remain, it will need to keep providing value to users.
What do you think are the best features of this tool and how are you using it? Would love to hear your thoughts!
The code snippets above can be found in this Colab notebook
References
https://duckdb.org/docs/
https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf
https://archive.ics.uci.edu/ml/datasets/Iris
