Using DuckDB for Ad-Hoc Analysis: A SQL-Lover's Alternative to Pandas

#duckdb #database #pandas #analytics

Ad-hoc analysis is an integral part of anyone in the data field. We have to on a regular basis combine data from various sources like CSV files, parquet files and databases for some ad-hoc testing or quick reporting. The most common tool at our disposal is Python's Pandas where we can read data from different sources into dataframes and then do the analysis. However, for someone who is more comfortable with SQL than Python, DuckDB is an excellent alternative. It lets you query data using SQL, without needing to load it into a database or convert it into a Pandas dataframe.

But first, what is DuckDB?

DuckDB is a modern, in-process analytical database. It supports a feature-rich SQL dialect and thanks to its columnar engine, it is blazingly fast. DuckDB is super quick to install (yes, you can get it up and running in less than a minute). Unlike traditional databases, it doesn't require a server, meaning you can embed it directly into your applications or run it directly on your local machine with minimal setup. It can read and write file formats such as CSV, Parquet, and JSON, to and from the local file system and remote endpoints such as S3 buckets. You can also pull data from databases like MySQL, BigQuery and others.

Why use DuckDB for Ad-Hoc Analysis?

If you're comfortable with SQL, DuckDB lets you connect to various sources and process data using SQL dialect. Given its in-memory columnar engine, it is extremely fast (yes, it leaves Pandas in the dust). And it can handle medium-sized data loads (up to a few GBs) comfortably on your local machine. Yes, it can handle larger-than-memory workloads by spilling to disk. I hope I have convinced you how cool DuckDB is.

If you prefer using an SQL client, DBeaver currently supports DuckDB. You can simply select DuckDB on the connection page on DBeaver and give some location on your local machine if you want persistent storage or run it in-memory completely by adding ":memory:" in the path. More detailed instructions here.

Once DBeaver is connected to your DuckDB, you can read from a CSV file using the following (yes, that simple!):

SELECT * FROM read_csv('file_name.csv');

You can connect to a MySQL database and read from the database directly using:

INSTALL MYSQL;

ATTACH 'host=host_name user=user_name port=3306 database=db_name password=password' AS mysq_db (TYPE MYSQL, READ_ONLY);

The above lines create a connection to your MySQL database in read-only mode. "mysql_db" is like the alias name you can give to the connection.

To read a table from the MySQL database, simply use the connection alias with your regular SQL syntax (I know, equally simple!).

SELECT *
FROM mysql_db.table_name;

Now to the interesting part. Let us say you have a CSV file that has departments and department codes in a departments.csv file like the below:

department_code	department_name
FI	Finance
HR	Human Resources

And let's say you have an employees table in your MySQL database that has the employee name and the department code as below:

department_code	employee_name
FI	Raghu
HR	Himesh

To combine both in DuckDB to get the employee name and department name, we can use the below code.

-- Connect to MySQL DB
ATTACH 'host=host_name user=user_name port=3306 database=db_name password=password' AS mysq_db (TYPE MYSQL, READ_ONLY);

-- Enable filter pushdown 
SET mysql_experimental_filter_pushdown=true;

-- Join the CSV file and MySQL tables

SELECT d.department_name,
       e.employee_name
FROM read_csv('departments.csv') d
INNER JOIN mysql_db.employees e
     ON d.department_code = e.department_code

Setting "mysql_experimental_filter_pushdown" to true will push down any filters to the database and only read the filtered data out. You can use simple SQL dialects to read from multiple sources and combine them in a single SQL statement using DuckDB. I will leave you here to play with DuckDB yourself and explore what possibilities it offers.

To leave you with more inspiration, here is a blog post on how the author handled 450Gb of data in DuckDB - https://towardsdatascience.com/my-first-billion-of-rows-in-duckdb-11873e5edbb5. And the cherry on top is that you can save costs as well by pulling data to your local machine and processing it in DuckDB.

And you can read more on what is possible in DuckDB from their documentation here - https://duckdb.org/docs/

Do comment and share your thoughts on how you want to use DuckDB in your day-to-day work. Let's wait and watch how this amazing piece of technology will evolve in the coming years.

Thanks for reading. Do share any comments and feedback!

DEV Community

Using DuckDB for Ad-Hoc Analysis: A SQL-Lover's Alternative to Pandas

But first, what is DuckDB?

Why use DuckDB for Ad-Hoc Analysis?

Top comments (0)

Read next

Schema Design Patterns For MongoDB

Just-in-Time ⏳ Database Access - Part 1️⃣

Top AI Tools for Generating SQL in 2024

The MinIO alternative for Time-Series Based Data