DEV Community

Karen Ngala
Karen Ngala

Posted on • Edited on

Exploratory Data Analysis: Ultimate Guide

_Note: Some terms can be confusing for beginners when used interchangeably in articles (even when they shouldn't). I thought it'd be neat to define them before we jump in.

  1. Variable vs Value
    • In a dataset, a variable is a characteristic or attribute that is being measured or observed for each individual or unit in the dataset. For example, in a dataset of student grades, variables could include the student's name, class, subject, and test scores.
    • On the other hand, a value is a specific measurement or observation of that variable for a particular individual or unit in the dataset. For example: if there were 20 students in the dataset, there would be 20 values for each variable.
  2. Column vs Feature
    • A column in a dataset can also be referred to as a feature. The variables we talked about, appear as columns in a dataset. These columns are considered features. Therefore, the terms "column" and "feature" can be used interchangeably to refer to a variable or attribute in a dataset that is used to build a model.

What is covered in this guide:

  1. What is Exploratory Data Analysis?
  2. Why is it important?
  3. Common EDA techniques
  4. Types of EDA

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a technique used by data professionals to examine or understand datasets before modelling them. Simply put, the goal of EDA is used to discover different underlying patterns and trends, relations, structures, and anomalies in the data.

EDA plays two main roles: cleaning data as well as understanding variables and the relationships between them.

Analyzing data enables analysts to derive meaningful insights that will help identify data cleaning issues, inform the choice of modelling technique, and hypothesis testing. EDA is an iterative process consisting of activities such as data cleaning, manipulation and visualization. The EDA process can be revisited at any stage of the data analysis process if need be.

Importance of EDA

EDA allows data analysts to understand the data better by:

  • identifying important variables.
  • understanding the relationships between variables.
  • identifying issues in data that can affect the accuracy of your models, such as missing variables, outliers.
  • uncovering hidden patterns in a dataset that were not obvious to the naked eye.
  • drawing new insights that affect associated hypotheses. These hypotheses are tested and explored to gain a better understanding of the dataset.

Components & Techniques in EDA

The technique or steps you choose to employ is determined by the task you are performing and the dataset you are working with. You may not need to follow all the steps below.

1. Understand the Data

It is important to understand the nature of data you are working with. In this step, you need to:

1. Import the libraries you will need for analysis

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
Enter fullscreen mode Exit fullscreen mode

The next natural step is to load your data into your working environment:

data = pd.read_csv("file.csv")
Enter fullscreen mode Exit fullscreen mode

2. Conduct preliminary analyses on the data. This involves answering the following questions:

a. What is the size of my dataset and what are the variable data types?

data.shape # returns the number of rows by the number of columns in the dataset

data.columns

data.dtypes
Enter fullscreen mode Exit fullscreen mode

b. What does my data look like?

data.head() # view first few records of data

data.describe() # summarizes the count, mean, standard deviation, min, and max for numeric variables
Enter fullscreen mode Exit fullscreen mode

c. Are there any missing variables?

data.isnull().sum() #check for missing values

data.info() # show the data types of each attribute

#Checking for wrong entries (symbols -,? # *)
for col in data.columns:
    print('{} : {}'.format(col,auto[col].unique()))

data.<column_name>.unique() # applied to a column of data to return a list of unique values in that column.
Enter fullscreen mode Exit fullscreen mode

There can be many reasons for missing values, such as:

  • There was no response recorded
  • Error while recording the data
  • Error in reading the data

Categorize your values:

After finding the missing values in your data, you need to determine what category the values fall in. This will help you determine the best method of handling the missing values as well as help you determine the statistical and visualization methods that can work with your dataset.

  • Categorical variables can have a set number of values.
  • Continuous variables can have an infinite number of values.
  • Discrete variables can have a set number of values that must be numeric.

How we handle missing values depends on the situation itself and the relations these variables have with other variables. We can:

  • Delete all the missing value rows from the dataset before training the model.
  • Imputation: various methods of filling the missing values.

Ways of imputing missing values:
For **continuous* data, you can:*

  • Replace the missing value with the mean, median or mode value
  • Train a linear model to predict the missing value

For **categorical* data, you can:*

  • Replace the missing value with the mode value
  • Train a classification model to predict the missing value

2. Clean the Data

The above steps are part of many ways through which you can understand the data you are working with. The insights gained will be used in this step to help you correct some of the issues in your dataset, so as make it more usable.
a. Remove redundant variables

cleaned_data = cleaned_data.copy().drop(['variableA','variableB','variableC'], axis=1)
Enter fullscreen mode Exit fullscreen mode

b. Remove rows with null values

# Using dropna(axis=0) to drop rows with null values
cleaned_data = cleaned_data.dropna(axis=0)
cleaned_data.shape # to see the change in dataset size
Enter fullscreen mode Exit fullscreen mode

c. Remove outliers
You can identify outliers by visualization (discussed later in the article), z-score method, interquartile range method, and machine learning-based methods.

Outliers are data points that are noticeably different from the rest. They represent errors in measurement, bad data collection, or variables not considered when collecting the data.
For X to be an outlier, it should satisfy the criteria:

X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)
# where:
# Q1: median for first 25% observation when sorted in ascending order
# Q2: median for last 25% observation when sorted in ascending order
# Q3: median of all observation
# IQR: Inter quartile range = Q3-Q1
Enter fullscreen mode Exit fullscreen mode

So, what do you do when you have skewed data and outliers?

  • Replace outlier values with more suitable values using Quartile or Interquartile range(IQR) methods.
  • Use a different machine learning model that is not sensitive to outliers eg: Naive Bayes Classifier or Decision Tree Regressor.
  • Use a lot of training data to improve the signal-to-noise ratio. Outliers will have less impact on the statistical average if you are working with a lot of data.
  • Removing all outliers by not picking them for further processing.
  • Use transformation methods to remove skewness and make your data normally distributed

Normalization:
Transformation methods are used to remove outliers, therefore normalizing the dataset. Some methods of variable transformation include log, square root, and box-cox. For example, the value of x can be replaced by its log value or column mean.

# Replacing missing values with mean:
num_col = ['columnA', 'columnB',  'columnC']
for col in num_col:
    data[col]=pd.to_numeric(data[col])
    data[col].fillna(data[col].mean(), inplace=True)
Enter fullscreen mode Exit fullscreen mode

Normalization is important to ensure all features are on a similar scale so as to improve the accuracy and integrity of your data. If a dataset has features that are bigger in scale than others, they become dominating leading to inaccurate results. Using un-normalized inputs can cause your model to get stuck at very flat regions which can stop the model from learning.

3. Analyze variable relationships

Correlation Matrix:
A correlation matrix is a table that shows how strongly different pairs of variables in a dataset are related to each other. Two variables have a:

  • Positive correlation when one goes up and the other goes up.
  • Negative correlation when one goes up and the other goes down.
  • or no relationship between them.

This is the fastest way to get a general understanding of all your variables. They help us identify which variables are important for predicting or explaining a particular outcome of interest.

# calculate correlation matrix
plt.figure(figsize=(10,10))
sns.heatmap(cleaned_data.corr(),xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(220, 20, as_cmap=True))
Enter fullscreen mode Exit fullscreen mode

Visualization:
By drawing visual representations of your data, such as histograms, scatter plots and pie charts, you can get a better understanding of the distribution of your data. Further, visualization helps in identifying patterns and detecting outliers in a dataset.

How do I know which charts to generate?
Visualizations are all about asking analytical questions. Once you have understood your data - such as the columns(also known as features), you can ask questions to understand their relationships.

For example, if you have a dataset containing different car features such as horsepower, engine quality and price, we can ask: "How does engine quality affect price?" From this question, we can generate a scatter plot or histogram to show their relationship.
1. Histogram - shows the frequencies of each category in a dataset.

cleaned_data['columnX'].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black')
cleaned_data['columnY'].plot(kind='hist', bins=20, figsize=(12,6), facecolor='grey',edgecolor='black')
Enter fullscreen mode Exit fullscreen mode

2. Pie Chart - commonly used to display the distribution of a single categorical variable as a percentage of a whole.

data['columnA'].value_counts().iloc[:5].plot.pie(autopct="%1.2f%%",fontsize=13,startangle=90,labels=['']*5, cmap='Set2',explode=[0.05] * 5,pctdistance=1.2)
Enter fullscreen mode Exit fullscreen mode

3. Box Plot - visualize the distribution of a variable.

cleaned_data.boxplot('columnA')
Enter fullscreen mode Exit fullscreen mode

A box plot can also be used to compare two variables. From te bboxplot below, the average price of a vehicle with two doors is 10000, and the average price of a vehicle with four doors is 12000.

sns.boxplot(x='price',y='num_of_doors',data=auto)
Enter fullscreen mode Exit fullscreen mode

4. Scatter plots - ‘plots’ the values of two variables along two axes. Like a correlation matrix, it shows the relationship between variables and identifying outliers.

cleaned_data.plot(kind='scatter', x='columnA', y='columnB')
Enter fullscreen mode Exit fullscreen mode
sns.pairplot(cleaned_data) # creates scatter plots between all of your variables.
Enter fullscreen mode Exit fullscreen mode

Types of EDA

There are a few types of EDA techniques:

  1. Univariate analysis: This involves examining the distribution of a single variable. The goal is to understand the central tendency (mean, median, mode), variability (range, interquartile range, standard deviation), and shape (skewness, kurtosis) of the variable.
    When exploring a single variable, we can use the following methods:
    a. For continuous data:

    • Tabular Method of describing central tendencies, dispersion, and missing values.
    • Graphical Method for distribution(Histograms) and detecting Outliers(Box Plots). b. For Categorical variables:
    • Tabular Method: .value_counts() operation in python gives a tabular form of frequencies.
    • Graphical Method: The best graph used for categorical variable is barplot.
  2. Bivariate analysis: This involves analyzing the relationship between two variables. The goal is to understand how changes in one variable affect changes in another variable. Common bivariate analysis techniques include scatter plots, line charts, and correlation analysis.
    When exploring a two variables, we can use the following methods:
    a. For continuous data: scatter plots and the correlation analysis.
    b. For categorical-continuous types: use bar plots and T-tests for the analysis purpose.
    c. For Categorical-categorical types: use Two-way table and Chi-square test.

  3. Multivariate analysis: This involves analyzing the relationship between multiple variables. The goal is to understand how multiple variables interact with each other and to identify any patterns or relationships that may exist. Common multivariate analysis techniques include principal component analysis (PCA) and factor analysis.

Conclusion

I hope this article gave you a better understanding of Exploratory Data Analysis and how to apply EDA techniques to your dataset.

Feedback is very welcome and highly appreciated.

Top comments (0)