Exploratory data analysis (EDA) is the process of studying data using visualization and statistical methods to understand the data. In other words, it is the first look on your data. It is a vital step before you begin with the actual data analysis. EDA helps to discover relationships within data, identify patterns and outliers that may exist within the dataset. Data scientists use EDA to ensure the results they produce are valid and applicable to any desired business outcomes and goals.
Objectives of EDA
The main objectives of EDA are to:
- Confirm if the data is making sense in the context of the problem being solved. In the case where it doesn't, we come up with other strategies such as collecting more data.
- Uncover and resolve issues on data quality such as, duplicates, missing values, incorrect data types and incorrect values.
- Get insights about the data, for example, descriptive statistics.
- Detect anomalies and outliers that may cause problems during data analysis. Outliers are values that lie too far from the standard values.
- Uncover data patterns and correlations between variables.
Types of EDA
Exploratory data analysis is classified into three broad categories namely:
Univariate
Bivariate
Multivariate
Steps in EDA
The following is a step-by-step approach to undertaking an exploratory data analysis:
- Data Collection
Gather relevant and sufficient data for your project. There are various sites online you could get your data from irrespective of the sector you're in. Here are a few examples to check out: Kaggle, Datahub.io, BFI film industry statistics
2.Familiarize with data
This step is important as it helps you to determine whether the data is adequate for the analysis about to be done.
3.Data cleaning
This where any missing values, outliers and duplicates are identified and removed from the dataset. Also, data that is irrelevant for the anticipated analysis is removed at this stage.
4.Identify associations in the dataset
Look for any correlations between variables. You can use a heatmap or scatterplots to make it easier for you to identify the correlations.
Example: Exploratory Data Analysis using NYC Citi Bike data.
We will now perform an exploratory data analysis on NYC Citi Bike data to get a better understanding of the process. You can access the data here.
1.Import data
The first step is to import all the modules you are going to use in your project. In this case, we will need pandas for data wrangling, seaborn for data visualization. This is how I would do it:
`import pandas as pd
import seaborn as sns
Then import your dataset. If you're using Google colab, this is how you would load the data:
from google.colab import files
uploaded = files.upload()
You will then read in the data as a pandas data frame like this:
2.Get an overview of the data
You can approach this in various ways. For example, using .info() helps us to know the data types, number of columns, column names, and number of values in the data frame. The following is an example:
The other alternative is to use .describe(). This gives you a statistical summary of the data. Here's an example:
3.Visualize the distribution for trip duration
This will help us to have a glimpse of how long most trips took. Using seaborn, this is how I would do it:
# visualize distribution for trip duration
sns.histplot(data['tripduration'])
Here's the sample output:
From the output, it is evident that most trips were ranging within 10 minutes.
4.Visualize correlation between gender and trip duration
# checking for association between tripduration and gender using scatterplots
sns.pairplot(data[['tripduration', 'gender']])
Sample output is as follows:
5.Calculate the percentage of subscribers
we need to find out the share of subscribers from the total number of riders in New York city. Here's how to find out:
6.Evaluate how trip length varies based on trip start time
data['hour'] = data.starttime.apply(lambda x: x[11:13]).astype('str')
data
# visualize correlation
sns.scatterplot(x= 'hour', y= 'tripduration', data = data, hue= 'usertype')
The output is as follows:
7.Determine the bike stations where most trips starts
First we get the count of bike stations and store the output as a new data frame. We we then drop the duplicates from the original data frame then merge the two new data frames for visualization.
# Get the count of trips from each station
new_data = data.groupby(['start station id']).size().reset_index(name= 'counts')
#remove duplicate values from the start station id column
temp_data = data.drop_duplicates('start station id')
# left join to merge new_data and temp_data dataframes
newdata2 = pd.merge(new_data, temp_data[['start station id', 'start station name', 'start station latitude', 'start station longitude']], how= 'left', on= ['start station id'])
#install folium
!pip install folium
import folium
# initialize a map
m = folium.Map(location=[40.691966, -73.981302],tiles= 'OpenstreetMap', zoom_start= 12)
m
The output is as follows:
Conclusion
EDA is very crucial as it affects the quality of the findings in the final analysis. The success of any EDA is dependent on the quality and quantity of data, the type of tools and visualization used, and proper interpretation by a data scientist.
Top comments (0)