Sarah Schlueter

Posted on Jan 10, 2024

Getting Started with Pandas Library

#python #pandas #tutorial #beginners

I recently completed the US States Game project from the Python Pro Bootcamp course on Udemy in which we used the Pandas library to read and use data from a .csv file in order to create the functionality of the game.

Since there is a lot you can do with Pandas, I wanted to dive into a few of the things I learned about it and solidify the concepts I learned. I also want to make note that Pandas has really great documentation and I encourage you to check it out: https://pandas.pydata.org/docs

Installing & Importing Pandas into Your Project

Installing and importing Pandas is easy. For this post, I will be working in PyCharm.

MacOS:

Windows:

Start by importing Pandas into your project by writing the following ling of code at the top of your file:

import pandas

In PyCharm, you may see some red squiggly lines underlining pandas which means Pandas package needs to be installed. On MacOS you can just click the red light bulb that pops up when you hover over it and select ‘Install package pandas’. On Windows you can install the package by opening the ‘Python Packages’ window, searching for Pandas, and installing the current stable version (this can be found in the docs). At the time of this writing, the current version is 2.1.4.

Mac OS:

Windows:

💡 Note: We can also implement import pandas as pd so that we can write the shorthand pd in our code instead of writing out pandas when using it.

Handling Import Errors

I did not come across any errors using Pandas in PyCharm, but I opened my project in VS Code and came across errors later when running the program. This was because I was using a different interpreter from the system in VS Code vs. a virtual environment. It is recommended to use Pandas from a virtual environment (venv) rather than a system environment. I won’t go into great detail here, but you’ll just want to make sure you’re Python interpreter is set up in a virtual environment, as recommended in the documentation.

Basic Operations

Creating a DataFrame

Pandas uses a data structure called a DataFrame which is a two-dimensional structure that can be thought of as like a dictionary. You can create a DataFrame manually or from a .csv file.

To create a DataFrame manually, we can start by creating a dictionary of data:

user_list = {
    "first name": ["Scott", "Kevin", "Johnny"],
    "last name": ["Woodruff", "Bong", "Cosmic"],
    "email address": ["scott@example.com", "kevin@example.com", "johnny@example.com"],
    "user id": [123, 234, 345]
}

Then we will create a variable called user_data and call pandas.DataFrame on our user_list.

After that, we can call the to_csv() method on our user_data DataFrame object and pass in the name of the file we want to create:

user_data = pandas.DataFrame(user_list)
user_data.to_csv("user_data.csv")

After we hit run, we can see that there exists now a .csv file called user_data in the same directory as our main.py file. Opening it up, we can see our key names are the column names and each row is indexed. All of the values are separated by commas as .csv stands for Comma Separated Values.

We can also open it up in a spreadsheet and see the values there as well:

Let’s say we already have a .csv file of data that we want to use. To start working with that data in your project, you can place the .csv file into your project folder, and then use the read_csv() method. Here, I have a sample dataset of housing info I grabbed from PyDataset that I created a .csv file from just to show:

💡 Note: you can access sample datasets with PyDataset by adding from pydataset import data to the top of your file and choosing the data you want to work with. You can find more info on how to do this here. It’s pretty easy.

To start working with our data, we need to call the read_csv method on our dataset and save it into a variable:

data = pandas.read_csv("housing_data.csv")

You’ll want to make sure that the spelling and file path matches the file exactly otherwise you’ll get an error.

Now that we have our housing data in a variable, we can work with it. You’ll notice if you print data you can see the output shows a nice table with the column names and values.

You can confirm that we’ve created a DataFrame by checking the type of data and see the output in the console says it’s a pandas.core.frame.DataFrame:

Accessing Data by Column or Row

Each column in a DataFrame is called a Series and there are loads of different things you can do with them. There is a whole section in the docs dedicated to Series here.

Pandas automatically turns column names into attributes, so we can access a whole column using dot notation after our dataframe variable name. In our case, if we want to get only the column of prices, we can write:

print(data.price)

This will return just the items in the price column:

To access just a particular row of data, we’ll use brackets after our variable, then specify the column as before using dot notation and set it equal to the row value for which we want all related data. For example:

print(data[data.price == 38500.0])

This will return:

You can see the row of data here where the price column equals 38500.0 as indicated in the code.

Filtering and manipulating data

Now that we know how to obtain specific columns and rows, let’s use that to filter our data. Let’s say we want to find the lowest and highest home prices in this list. We can use the min() and max() methods for this.

print(data.price.max())
print(data.price.min())

OUTPUT:
190000.0
25000.0

Now say we are looking for a house and would like to have a list of only the homes that fit our budget and criteria. Let’s create a list that only contains homes that are under $50,000 and have at least 2 bedrooms and 1 bathroom.

Step 1: set the criteria as variables

max_price = 50000.0
num_bedrooms = 2
num_bathrooms = 1

Step 2: Apply the filters

filtered_data = data[(data['price'] < max_price) &
                     (data['bedrooms'] == num_bedrooms) &
                     (data['bathrms'] == num_bathrooms)]

Step 3: Create new .csv file with the filtered values

filtered_data.to_csv("our_home_choices.csv")

Our new file will appear in our project folder. We can open it to see that it includes only homes that are under $50,000 and have 2 bedrooms and 1 bathroom:

We can also check to see how many records there are in our new list by running the following line of code:

print(len(filtered_data))

In this case, it returns 61. So we can see that we have 61 homes to choose from with the criteria that we set. This can be further filtered down with the same method using the other columns as attributes.

Some Other Useful Methods

to_dict() Method

The to_dict() method is used to convert a DataFrame into a dictionary. This can be particularly useful when you need to transform your data for a format that is more compatible with certain types of processing. The method offers various formats for the dictionary, like orienting by rows, columns, or even index.

Example:

# Converting the entire DataFrame to a dictionary
data_dict = data.to_dict()

# Converting the DataFrame to a dictionary with a specific orientation
data_dict_oriented = data.to_dict(orient='records')  # Each row becomes a dictionary

to_list() Method

The to_list() method is used with Pandas Series objects to convert the values of the Series into a list. This is particularly useful when you need to extract the data from a DataFrame column for use in a context where a list is more appropriate, such as in loops, certain calculations, or data visualizations.

Example:

# Converting a DataFrame column to a list
price_list = data['price'].to_list()

# Using the list for further operations
average_price = sum(price_list) / len(price_list)

Both of these methods, to_dict() and to_list(), are part of Pandas' powerful suite of tools that make data manipulation and conversion simple and efficient, allowing for a smooth workflow between different data formats and structures.

Conclusion

I hope you enjoyed learning a bit about some of the things you can do with the Python Pandas library. This is just the tip of the iceberg, as there are many more capabilities to explore. I am excited to continue learning and sharing with you. I also encourage you to explore the documentation and experiment with your own datasets!

If you have any questions or if there's anything else you'd like to know about Pandas, please don't hesitate to reach out. I'll do my best to write a post about it. Writing also helps me learn more! 😊

Thanks for reading and happy coding!

Connect with me:
Twitter: @sarah_schlueter
Discord: sarahmarie73