Before we can carry out our Data Science project, we must first try to understand the data and ask ourselves some questions. Exploratory Data Analysis (EDA) is the preliminary phase of a Data Science project, that allows us to extract important information from the data, understand which questions it can answer, and which ones it cannot.
We can perform EDA using different techniques, such as visual and quantitative techniques. In this article, we focus on visual techniques. Many different types of graphs can be used to analyze data visually. They include line charts, bar charts, scatter plots, area plots, table charts, histograms, lollipop charts, maps, and much more.
During the Visual EDA phase, the type of chart we use depends on the type of question we want to answer. We do not focus on aesthetics during this phase, because we are only interested in answering our questions. Aesthetics will be attended to in the final data narrative phase.
We can perform two types of EDA:
- univariate analysis, which focuses on a single variable at a time
- multivariate analysis, which focuses on multiple variables at a time.
When performing EDA, we can have the following types of variables:
- Numerical — a variable that can be quantified. It can be either discrete or continuous.
- Categorical — a variable that can assume only a limited number of values.
- Ordinal — a numeric variable that can be sorted
In this article, I show you some of the most common visual techniques for EDA through a practical example, that uses the matplolib and seaborn Python libraries. The described concepts are general so you can easily adapt them to the other Python libraries or programming languages.
The article is organized as follows:
- Setup of the Scenario
- Visual Techniques for Univariate Analysis
- Visual Techniques for Multivariate Analysis
1 Setup of the Scenario
The purpose of this scenario is to illustrate the main graphs for Visual EDA. As a sample dataset, we use the IT Salary Survey for EU Region, available under the CC0 license. I would like to thank Parul Pandey, who wrote a fantastic article about 5 real-world datasets for EDA. I discovered the dataset used in this article there.
Firstly we load the dataset as a Pandas dataframe:
import pandas as pd
df = pd.read_csv('../Datasets/IT Salary Survey EU 2020.csv', parse_dates=['Timestamp'])
df.head()
The dataset contains 1253 rows and the following 23 columns:
'Timestamp',
'Age',
'Gender',
'City',
'Position ',
'Total years of experience',
'Years of experience in Germany',
'Seniority level',
'Your main technology / programming language',
'Other technologies/programming languages you use often',
'Yearly brutto salary (without bonus and stocks) in EUR',
'Yearly bonus + stocks in EUR',
'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
'Annual bonus+stocks one year ago. Only answer if staying in same country',
'Number of vacation days',
'Employment status',
'Сontract duration',
'Main language at work',
'Company size',
'Company type',
'Have you lost your job due to the coronavirus outbreak?',
'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR'
2 Visual Techniques for Univariate Analysis
Univariate Analysis considers a single variable at a time. We can consider two types of univariate analysis:
- categorical variables
- numerical variables
2.1 Categorical Variables
The first graph we can plot is the count plot, which counts the frequency of each category. In our example, we can plot the frequency of the Position column, by considering only positions with a frequency greater than 10. Firstly, we create the mask:
mask = df['Position '].value_counts()
df_10 = df[df['Position '].isin(mask.index[mask > 10])]
Then, we build the graph:
import matplotlib.pyplot as plt
import seaborn as sns
colors = sns.color_palette('rocket_r')
plt.figure(figsize=(15,6))
sns.set(font_scale=1.2)
plt.xticks(rotation = 45)
sns.countplot(df_10['Position '], palette=colors)
plt.show()
The second type of graph we can plot is the pie chart, which shows the same information of the count plot, but it also adds the percentage:
values = df_10['Position '].value_counts()
plt.figure(figsize=(10,10))
values.plot(kind='pie', colors = colors,fontsize=17, autopct='%.2f')
plt.legend(labels=mask.index, loc="best")
plt.show()
2.2 Numerical Variables
In this case, we may be interested in data distribution, so we could plot a histogram. The histogram breaks up all the possible values down into bins, then works out which bin a value belongs to. In our example, we could plot a histogram of the top 10 salaries, so we build a mask as follows:
Continue Reading on Towards Data Science
Top comments (1)
Hello!
Can you please help me in understanding a data science problem