DEV Community

Cover image for Exporting a large PostgreSQL table using Python
Varun Gujarathi
Varun Gujarathi

Posted on • Edited on

Exporting a large PostgreSQL table using Python

When working with large datasets, it's common to extract PostgreSQL tables to CSV files for further analysis and sharing. However, naively exporting a large table can crash your machine by consuming all available RAM.

In this article, I'll explain an efficient technique to export big PostgreSQL tables to CSV using Python, based on my experience as a Research Assistant analyzing criminal record relief data.

The Naive Approach

You can export a table like this:


# Execute a SELECT query 
cursor.execute("SELECT * FROM employee")

# Fetch all rows 
rows = cursor.fetchall()

# Write rows to CSV
with open("employees.csv", "w") as f:
  for row in rows:
    f.write(",".join([str(cell) for cell in row]) + "\n")
Enter fullscreen mode Exit fullscreen mode

However, this loads the entire table into memory - which will crash your machine for a big table!

Streaming Data in Chunks with psycopg2

The key is to stream the data in chunks from the database directly to disk. The psycopg2 module provides a handy COPY method that lets us do just that.

First, connect to the database and open the output CSV:


import psycopg2

conn = psycopg2.connect("dbname=mydb user=postgres")

with open('out.csv', 'w') as f:
Enter fullscreen mode Exit fullscreen mode

Next, use copy_expert() to stream the table to the file in chunks:

  cur = conn.cursor()

  cur.copy_expert('COPY employee TO STDOUT WITH CSV HEADER', f)
Enter fullscreen mode Exit fullscreen mode

The table is streamed to the CSV without loading everything into memory!

Putting It All Together
Here is the full script to efficiently export a large PostgreSQL table to a CSV:

import psycopg2

conn = psycopg2.connect("dbname=mydb user=postgres")

cur = conn.cursor()

with open('out.csv', 'w') as f:
  cur.copy_expert('COPY employees TO STDOUT WITH CSV HEADER', f)
print("Table exported to CSV!")
Enter fullscreen mode Exit fullscreen mode

By streaming the data in chunks directly to a file, we avoid RAM issues and efficiently export the table.

Extracting results of SQL query

The above method copies all the data from the employee table to a CSV file. But if you want to store the results of a query with the above method you will have to modify it slightly as below.

import psycopg2

conn = psycopg2.connect("dbname=mydb user=postgres")

cur = conn.cursor()

# list of departments from which we want all employees
dept_list = ['dept1', 'dept2']

query = """SELECT *
        FROM employee
        WHERE department IN %s"""

query_params = (tuple(dept_list),)

query = cur.mogrify(query, query_params)
query = query.decode('utf-8')

with open('out.csv', 'w') as f:
  cur.copy_expert("COPY ({}) TO STDOUT WITH CSV HEADER".format(query), f)
print("Query results exported to CSV!")
Enter fullscreen mode Exit fullscreen mode

This technique has helped me quickly extract large criminal record relief datasets from PostgreSQL databases for analysis. Let me know if you have any other tips and tricks!

Top comments (3)

Collapse
 
amankrokx profile image
Aman Kumar • Edited

\COPY my_table TO 'filename' CSV HEADER

Collapse
 
hawkaj profile image
Aarya Jha

hey could you also elaborate on how can we control the chunk size being processed here?

Collapse
 
varungujarathi9 profile image
Varun Gujarathi

There is no built-in method to control the chunk size. The COPY protocol itself focuses on efficient bulk data transfer. If you DO want to control you can change the config on the PostgreSQL server OR iterate through your data in loops.