Benjamin Rancourt

Posted on Jul 22, 2021 • Originally published at benjaminrancourt.ca on Jun 24, 2021

How to split a large CSV file based on the number of rows 🔪

#csv #python #performance #tooling

Recently I had to import CSV files into an external application at work, but the application was not able to process all the rows because it was triggering timeouts after a few minutes. The files weren't exactly huge, they were only 8MB and maxed out at 200,000 lines, but that was too much for this application... 😢

I had no control over the external software, but I needed smaller files. I could have adapted the application that creates them, but it would have required fetching the data from several APIs again and it was taking a long time. ⌛

Instead, I decided to try splitting the files directly. After a quick search on a search engine, I came accros a Gist in Python that does exactly what I wanted. My CSV file had semicolons (;) instead of commas (,), so I had to adapt its script a bit to accommodate this (delimiter=';').

Here is my adapted version, also saved in a forked Gist, in case I need it later:

import csv
import sys
import os

# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)

CURRENT_DIR = os.path.dirname(os.path.realpath( __file__ ))
filename = sys.argv[1]

full_file_path = os.path.join(CURRENT_DIR, filename)
file_name = os.path.splitext(full_file_path)[0]

rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000

with open(filename) as infile:
    reader = csv.DictReader(infile, delimiter=';')
    header = reader.fieldnames
    rows = [row for row in reader]
    pages = []

    row_count = len(rows)
    start_index = 0
    # here, we slice the total rows into pages, each page having [row_per_csv] rows
    while start_index < row_count:
        pages.append(rows[start_index: start_index+rows_per_csv])
        start_index += rows_per_csv

    for i, page in enumerate(pages):
        with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile:
            writer = csv.DictWriter(outfile, fieldnames=header, delimiter=';')
            writer.writeheader()
            for row in page:
                writer.writerow(row)

        print('DONE splitting {} into {} files'.format(filename, len(pages)))

Thanks to Kelvin Tay for his script, it saved me a lot of time! 🥳

DEV Community

How to split a large CSV file based on the number of rows 🔪

Top comments (0)

Read next

Top 10 Java Frameworks Every Dev Need to Know

Test All your Business Logic in less than 1 second

Thursday Quiz

Top 5 Python Libraries to Watch in 2025