Every once in awhile, I’ll have the need to load data from a spreadsheet into a Python program, but one question always comes up: what’s the best way to parse a spreadsheet in Python? The goal of today’s article is to find out!
Problem Introduction
Recently, I was learning a visualization library in Python called VTK, and I needed to find a way to visualize some data from a spreadsheet. Unfortunately, I had two problems:
- I didn’t know how to read a spreadsheet
- I didn’t know how to parse the data that I read
In other words, what are some good ways to read spreadsheet data? And, what are some good ways to model that data in a Python program?
Parsing CSV Files
That first question will be the target of this article. In particular, we’ll be looking at various ways to read data from a CSV file. For example, does it make sense to try to write our own parsing implementation? After all, a CSV is one of the easier file formats to parse (as seen below), and Python is great for working with strings:
Name,Age,Favorite Color
Jeremy,25,Blue
Ally,41,Magenta
Jasmine,29,Aqua
That said, we may prefer to use some of the utilities provided by Python like the csv package. What’s the point of reinventing the wheel when there’s a battle tested library built right into the language?
As usual, we’ll tackle a handful of solutions and discuss their pros and cons. By the end of this article, you should feel comfortable parsing CSV files yourself.
Internal Representation
As for the second question, we have a couple of options. For starters, we could interpret the spreadsheet as a giant list of lists. In this example, we could give each row of data its own list and store these rows in a list:
# Heading: Name, Age, Favorite Color
csv_matrix = [
["Jeremy", 25, "Blue"],
["Ally", 41, "Magenta"],
["Jasmine", 29, "Aqua"]
]
Alternatively, we could give each column of data their own list and store those lists in a list:
# Heading: Name, Age, Favorite Color
csv_matrix = [
["Jeremy", "Ally", "Jasmine"],
[25, 41, 29],
["Blue", "Magenta", "Aqua"]
]
In either case, we’d have a giant matrix of data that would closely resemble the original CSV file.
Of course, I’m a bit partial to dictionaries, so I might like to use one of those. For instance, what’s stopping us from creating a dictionary where each key provides us with an entire column of data?
csv_dict = {
"Name": ["Jeremy", "Ally", "Jasmine"],
"Age": [25, 41, 29],
"Favorite Color": ["Blue", "Magenta", "Aqua"]
}
Alternatively, we could flip the relationship so we’re storing a list of dictionaries. That way, the data rows are mappings:
csv_mapping_list = [
{
"Name": "Jeremy",
"Age": 25,
"Favorite Color": "Blue"
},
{
"Name": "Ally",
"Age": 41,
"Favorite Color": "Magenta"
},
{
"Name": "Jasmine",
"Age": 29,
"Favorite Color": "Aqua"
}
]
Personally, I prefer this last representation because data samples stick together. In other words, there’s no risk of distorting the original data set during actions like sorting. Meanwhile, the other three representations have independent lists which have to be sorted together. That’s enough to give this last representation the win.
In the next section, we’ll start digging into some parsing solutions which will all leverage this last representation. If you have any questions about the other three, feel free to drop them in the comments below.
Solutions
As always, we’ll cover a handful of common solutions to today’s problem. Feel free to grab what you need and run, but I recommend reading through the pros and cons of each solution. After all, you’ll never know when some of this might be useful.
Parse a Spreadsheet with Brute Force
At its core, a spreadsheet is a comma separated file. If we want to parse one by hand, we need to be aware of a few things:
- The header row
- The line endings
- The delimiter (in this case a comma)
In our original example, we had a spreadsheet that looked something like the following:
Name,Age,Favorite Color
Jeremy,25,Blue
Ally,41,Magenta
Jasmine,29,Aqua
To parse it by hand, we’d want to read each line and split it by comma. After that, we’d need to do some post processing to get the format we want:
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
line_count = 0
for line in my_data:
row_list = [val.strip() for val in line.split(",")]
if line_count == 0:
header = row_list
else:
row_dict = {}
for i, key in enumerate(header):
row_dict[key] = row_list[i]
csv_mapping_list.append(row_dict)
line_count += 1
Here is my very rough attempt at parsing a spreadsheet by hand. In this example, we open the CSV and read it line by line. For each line, we split it by comma and systematically trim each value using a list comprehension.
From there, we decide if the line we just parsed is the header or not. If it is, we save its value for later. Otherwise, we iterate over the row of values and map them into a dictionary using the header. I suppose it may have been simpler to use one of the methods from our How to Convert Two Lists into a Dictionary in Python article:
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
line_count = 0
for line in my_data:
row_list = [val.strip() for val in line.split(",")]
if line_count == 0:
header = row_list
else:
row_dict = {key: value for key, value in zip(header, row_list)}
csv_mapping_list.append(row_dict)
line_count += 1
In either case, this solution leaves a lot of room for modification. For instance, if your file has some other kind of delimiter, this may be the solution for you.
That said, be aware that the brute force solution has a major drawback. According to Juha-Matti Santala, splitting by comma can fail if one of the rows contains text with a comma in it (i.e. "Grifski, Jeremy",25,Blue
). To make matters worse, this problem changes depending on which delimiter you use. Fortunately, there are better solutions to follow!
Parse a Spreadsheet with the CSV Reader Object
As mentioned previously, we don’t have to write our own CSV parser if we don’t want to. Instead, we can get by with the csv package. As you can probably imagine, it has a ton of CSV parsing functionality. In particular, it contains the reader object which we can use to read a CSV file like we did previously:
import csv
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
csv_reader = csv.reader(my_data, delimiter=",")
line_count = 0
for line in csv_reader:
if line_count == 0:
header = line
else:
row_dict = {key: value for key, value in zip(header, line)}
csv_mapping_list.append(row_dict)
line_count += 1
With the reader object, we haven’t managed to simplify our code that much. In fact, all we did was replace the list comprehension with the reader instantiation. Regardless, it’s a nice option for those who would prefer not to write their own parser.
Parse a Spreadsheet with the CSV DictReader Object
At this point, you’re probably wondering why we would even use the csv library. After all, it barely made a difference in the last two examples. Fortunately, there is a way to reduce our code a bit using a special class in the csv library called DictReader:
import csv
with open("/path/to/dict.csv") as my_data:
csv_mapping_list = list(csv.DictReader(my_data))
And, there we have it! All the parsing we did in the first two sections has now been drastically reduced to just 3 lines of code.
That said, there is a minor difference in this solution. Instead of creating a list of dictionaries, we’ve created a list of OrderedDict objects. They can be used just like dictionaries, but their key order is fixed. Regardless, the elegance of this solution should more than make up for that fact, and in some cases we may even want to preserve key order.
A Little Recap
At this point, I find it nice to look at all the potential solutions at once:
# Brute force solution
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
line_count = 0
for line in my_data:
row_list = [val.strip() for val in line.split(",")]
if line_count == 0:
header = row_list
else:
row_dict = {key: value for key, value in zip(header, row_list)}
csv_mapping_list.append(row_dict)
line_count += 1
# CSV reader solution
import csv
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
csv_reader = csv.reader(my_data, delimiter=",")
line_count = 0
for line in csv_reader:
if line_count == 0:
header = line
else:
row_dict = {key: value for key, value in zip(header, line)}
csv_mapping_list.append(row_dict)
line_count += 1
# CSV DictReader solution
import csv
with open("/path/to/dict.csv") as my_data:
csv_mapping_list = list(csv.DictReader(my_data))
Up next, I’m looking to write an article on how to sort a list of dictionaries, so look out for that! Until then, thanks for sticking around. Hopefully, this article was helpful. If so, why not give it a share?
Top comments (4)
One of the main benefits for using the csv library in Python is that it handles edge cases much more gracefully than a custom and naive "split by comma" approach.
A valid csv might look like this:
Here, the split-by-comma approach will actually trip up because it doesn't regard the quoted string as a field.
Good point! I’ll add that note to the first section.
Good article. Another method to be aware of, if you are using the pandas library to process your data further, that library can read a CSV or Excel file into a DataFrame: pandas.pydata.org/pandas-docs/stab...
Probably good to mention in passing that reading/writing XLSX files directly works fine with an add on library