Silvester

Posted on May 14, 2024

Guide to exploring data in Python

#eventdriven #datascience #panda #python

Data professionals rely on Exploratory Data Analysis (EDA) to understand the data and how variables within the data are related. There are various tools used when performing EDA but the key of them all is visualization. Through visualizations, we can easily see how the data looks and we can make assumptions that will guide how we will analyze the data.

We will use Google Colab for this demonstration to show that you do not need to download Python software locally to uncover insights in your data. Google Colab is a powerful platform that allows you to write and execute your Python code in your browser and hence convenient for your data analysis needs.

Core EDA libraries in Python

Python has numerous libraries tailored for manipulating and analyzing data. Below are some of the libraries that you will need for your EDA:

Pandas - This library helps in loading the data and cleaning the data
Numpy - This library helps when performing numerical computations in Python. Numpy works with pandas and it is good for manipulating big datasets
Matplotlib and Seaborn are for visualizing the data

Loading the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data

The data for this demonstration was sourced from Milwaukee City datasets for 2023.

property_df = pd.read_csv('/content/armslengthsales_2023_valid.csv')

Overview of the data

Head: This function shows the top rows of the data

property_df.head()

This is an incomplete snap of the data. (there are many rows in the data and hence could not catpure the whole tablel).

Shape - This shows the number of rows and columns in the data

property_df.shape

(5831, 20)
The output (5831, 20) shows that there are 5831 rows and 20 columns

Data types - This shows the data types of the variables in the dataset.

property_df.dtypes

From the output above, we can see that our datasets have the data types int, float, and object. We can see from the output that we have 6 categorical variables (object) and 14 numerical (int64 and float64) variables.

Missing values

Another important part of EDA analysis is checking for missing values. Missing values are unknown, unspecified, or unrecorded values in the dataset. In Pandas, the missing values are usually represented with NaN.

From the table above, we can see that the columns CondoProject, Rooms, and Bdrms have missing values represented by NaN values.

The best way to see the null values in your dataset is by using the .info command:

property_df.info()

We can see from the above that while the total rows are 5831, some columns do not have 5831 rows. Some of the variables with missing values are CondoProject, Style, Extwall, Stories, Year_Built, Rooms, FinishedSqft and Bdrms. Let’s check the missing values in each column

property_df.isna().sum()

When faced with a variable with a big proportion of missing values, we can drop the affected column. For those that have fewer missing values, we can drop the rows or use estimates to replace the missing values.

Dealing with the column with the most missing values

From the previous tables, we saw that the CondoProject variable has more than 80% missing values. The best way we can deal with this variable is dropping it in its entirety as done below.

property_df.drop(columns='CondoProject', axis=1, inplace=True)

For the remaining variables with missing values with small proportion of missing values, we can just drop the respective rows that have missing values.

property_df = property_df.dropna()

After dropping the CondoProject column and the rows with null values, we can see that the total rows have dropped to 4690 from the initial 5831 and that we have 19 columns instead of the initial 20.

The outcome is:

Now, we have clean data and we can now perform data visualization.

Summary statistics

property_df.describe().T

The summary statistics show the minimum, maximum, first quartile, third quartile, mean, and maximum values for each variable in the dataset.

Univariate variables

1. Histogram of Year Built variable

sns.histplot(property_df['Year_Built'])

We can see from the histogram that most of the houses were built in the 1950s and 1920s. Since the 1980s, the number of houses built in Milwaukee has been declining.

2. Distribution Rooms variable

sns.histplot(property_df['Rooms'])
plt.title(“Distribution of Rooms Variable”)

From the histogram of the rooms variable, we can conclude that many properties have between 5 and 12 rooms. Only a few properties have more than 20 rooms.
3. Distribution of stories variable

sns.histplot(property_df['Stories'])
plt.title(“Distribution of Stories variable”)

Most of the properties are between 1 and 2 stories tall. There are a few properties that have between 2.5 and 4 stories.
4. Distribution of Sales Price variable

sns.histplot(property_df['Sale_price'])
plt.title("Distribution of Sales Price")

From the plot above, we can see that most of the properties are concentrated between around $15000 and $500,000. Other properties cost more than $1,000,000 but they are few.
5. Property style distribution

style_count = property_df['Style'].value_counts()
order = style_count.index
plt.figure(figsize=(12, 6))
sns.barplot(x=style_count.index, y=style_count.values, order=order, palette='viridis')
plt.ylabel('Frequency')
plt.title('Frequency of property styles')
plt.xticks(rotation=45)
plt.show()

The frequency plot above shows that the most common property styles in Milwaukee are Ranch and Cape Cod while the least popular property styles are Office and Store buildings.
6. Property type distribution

sns.histplot(property_df['PropType'])

We can see that the most common property type is residential property.

Bivariate variables

Scatterplot for finished square feet and sales price

plt.figure(figsize=(8, 6))
sns.scatterplot(x='FinishedSqft', y='Sale_price', data=property_df)
plt.title('Relationship between Finished Sqft and Sale Price')
plt.xlabel('Finished Sqft')
plt.ylabel('Sale Price')
plt.show()

Sales Price and Finished Sqft have a positive linear relationship. An increase in Finished Sqft leads to an increase in the Sales Price.

Multivariate

Correlation plot

#computing correlation matrix
corr_matrix = property_df.select_dtypes(include='number').drop(columns=['PropertyID', 'taxkey', 'District']).corr()

# Plotting the heatmap of correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

The correlation matrix above shows us how the variables are related. For instance, we can see that Rooms and Bedrooms variables are highly correlated with a correlation of 0.86.

Conclusion

This comprehensive guide has shown you the fundamental steps that we use when exploring data. Throughout this tutorial, you have learned to load and have an overview of your dataset, handle missing values, perform both univariate and bivariate analysis, and finally examine multivariate relationships using correlation analysis.

From this analysis, we have learned some valuable insights on houses in Milwaukee city like pricing range, property sizes, and architectural styles. These insights that we have uncovered highlight the powerfulness of EDA and why every data practitioner should be good at it. This guide has given you a good foundation to keep exploring and visualizing your data. Continue exploring your data to unlock more insights!

DEV Community

Guide to exploring data in Python

Core EDA libraries in Python

Loading the libraries

Loading the data

Overview of the data

Missing values

Dealing with the column with the most missing values

Summary statistics

Univariate variables

Bivariate variables

Multivariate

Conclusion

Additional Resources

Top comments (0)

Read next

Building a Chess Game with Python and OpenAI

Mastering Python for Web Development: Best Practices 🐍💻

7 Powerful Python Performance Optimization Techniques for Faster Code

We made an AI SWE that solved 48.60% of issues on the SWE bench, 100% open-source.