Happy to be here again. In today's article, two keywords in the title are going to be defined. Exploratory Data Analysis and Data Visualization. With an understanding of these and a sample project for the purposes of description, everything will be understood. I have discovered that Exploratory Data Analysis is a step that cannot be skipped in any Data Science project, whether one likes it or not.
Exploratory Data Analysis.
Is the process of investigating a dataset in order to come up with summaries/hypothesis based on our understanding of the data, discovering patterns, detecting outliers and gaining insights through various techniques. Data visualization is one of them.
Data Visualization
A graphical representation of the information and the data.
Importance of Data Visualization
- In the cleaning process, it helps identify incorrect data or missing values.
- The results can be interpreted and operated on because they become clear.
- Enables us to visualize stuff that cannot be observed by directly looking. Phenomenons like weather patterns and medical conditions. Also matematical relationships e.g when doing finance analysis.
- Helps us to construct and select variables. We can be able to choose which to discard and which to use.
- Bridge the gap between Technical and non-technical users by explaining figuratively what has been written in code.
Knowing the different types of analysis for data visualization is an important additional concept.
Univariate Analysis:In this type, we analyze almost all the properties of only one feature.
Bivariate Analysis: In this one, analysis of properties is done for two features. We compare exactly two features.
Multivariate Analysis:Here, we compare more than two variables.
Let's get right to it.
Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import seaborn as sns
Reading the data
Step two is reading the data which is mostly in csv format. Using pandas library. I used this dataset.
df = pd.read_csv('StudentsPerformance.csv')
df.head()
Our dataset looks like this after running df.head(), which outputs the first 5 rows.
You can easily tell just by looking at the dataset that it contains data about different students at a school/college, and their scores in 3 subjects.
Describe the Data
After loading the dataset,the next step is to summarize the info and it's main characteristics. Consider it as a way to get summary statistics like the mean, the maximum, minimum values, the 25th percentile e.t.c of the different columns in a data frame.
The output is something like this.
Please also note, if you want to include categorical features(features that have not been represented by numbers) also in your output, just run df.head(include='all').
Now, in the output, count, unique and the most appearing values(top) have been filled. See below,
Check for missing values
Incase of any missing entries, it is advisable we fill them. For categorical features, with mode and for numerical features with median or mean. Run df.isnull().sum()
Phwyuuks!! We don't have any missing values.
We can now proceed to observe any underlying patterns, analyze the data and identify any outliers using visual representations. I loooovee this part. Let's do it!
Graphs
Remember the three types of analysis we mentioned before? Let's look at them. We'll start with Univariate analysis. A bar graph. Look at the distribution of the students across gender, race, their lunch status and whether they have a course to prepare for or not.
plt.subplot(221)
df['gender'].value_counts().plot(kind='bar', title='Gender of students', figsize=(16,9))
plt.subplot(222)
df['race/ethnicity'].value_counts().plot(kind='bar', title='Race/ethnicity of students')
plt.xticks(rotation=0)
plt.subplot(223)
df['lunch'].value_counts().plot(kind='bar', title='Lunch status of students')
plt.xticks(rotation=0)
plt.subplot(224)
df['test preparation course'].value_counts().plot(kind='bar', title='Test preparation course')
plt.xticks(rotation=0)
plt.show()
The output:
We can conclude a lot of information. For instance,
- There are more girls that boys.
- The majority of students belong to race groups C and D.
- More than 60% of the students have a standard lunch.
- More than 60% of the students have not taken any test preparation course.
Next, lets look at univariate analysis and use a boxplot. A boxplot helps us in visualizing the data in terms of quartiles. Numerical columns are visualized very well with boxplots. We use function df.boxplot()
- The horizontal green line in the middle represents the median of the data.
- The hollow circles near the tails represent outliers in the dataset.
- The middle portion represents the inter-quartile range(IQR)
From those points, we conclude that a box plots show the distribution of data. How far is our middle value data dispersed or spread. So lets plot some distribution plots to see. We'll start with the math score.
sns.distplot(df['math score'])
Well, the tip of the curve is at around 65 marks, the mean of the math score of the students in the dataset. We can make for the reading score and the writing score.
- For our reading score curve, it's not a perfect bell curve. We conclude that the mean of the reading score is at around 72 marks.
- For our writing score, it's also not a perfect bell curve. The mean of the writing score is at around 70 marks. So far so good, right? One more thing, let's look at the correlation between the three scores by use of a heatmap. Correlation basically means looking at the linear relationship between variables. If one variable changes, how does that affect the other?
corr = df.corr()
sns.heatmap(corr, annot=True, square=True)
plt.yticks(rotation=0)
plt.show()
- The 3 scores are highly correlated.
- Reading score has a correlation coefficient of 0.95 with the writing score. Math score has a correlation coefficient of 0.82 with the reading score.
Bivariate analysis:Understand the relationship between 2 variables on different subsets of the dataset. We can try to understand the relationship between the math score and the writing score of students of different genders.
sns.relplot(x='math score', y='writing score', hue='gender', data=df)
``
The graph shows a clear difference in scores between the male and female students.For the math score, female students are more likely to have a higher writing score than male students. For writing score, male students are expected to have a higher math score than female students.
Finally,let’s look at the impact of the test preparation course on students’ performance using a horizontal bar graph.
It is very evident that students who completed the test preparation course perfomed better than those who didn't.
That's the end Guys.
Thank you for following through.
YOU CAN DO IT!
Top comments (0)