Introduction
In the journey of learning data science, one of the most essential processes is understanding the data you're working with. Before building complex models it is critical first to perform Exploartury Data Analysis (EDA). EDA enables data scientists to make sense of their data, reveal patterns, and detect abnormalities. This article is aimed to guide you through the requisites of EDA, more importantly when you are just starting your data science journey.
What is Exploratory Data Analysis
Exploratory data analysis (EDA) refers to an analysis approach used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. it is an essential step in all data science projects as it brings out the insights that create the next process, whether determining the right model or comprehending the fundamental layout of the data.
So in general we can declare that EDA is not only about just looking at numbers but also about gaining a deeper understanding of the data we have in hand.
Keys Steps in EDA
1. Data Collection and Cleaning
Before you get to prospect your data, you need to first make sure it's clean and in an organized format. This necessitates handling null values, correcting inconsistencies, and making sure the data is in an organized format.
2. Descriptive Statistics
The first process in EDA is to compute basic descriptive statistics such as mean, median, mode, and standard deviation. These statistics give us a synopsis of the data and help us comprehend its fundamental tendency and spread.
Example in Python code:
import pandas as pd
# Load a sample dataset
data = pd.read_csv('sample_data.csv')
# Calculate descriptive statistics
mean = data['column_name'].mean()
median = data['column_name'].median()
mode = data['column_name'].mode()
std_dev = data['column_name'].std()
variance = data['column_name'].var()
print(f"Mean: {mean}, Median: {median}, Mode: {mode[0]}, Standard Deviation: {std_dev}, Variance: {variance}")
3. Data Visualization
It is a method that helps discover new features and trends and draws relationships between the values in the data set. undefined
Histograms: To distribute a single variable, therefore, is to illustrate the extent of the variable within a particular population.
Box Plots: Emphasize value distribution and search for outliers.
Scatter Plots: Use to examine the correlation between two variables that are both on a continuum.
Bar Charts: To analyze the categorical data.
Example in Python code:
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
sns.histplot(data['column_name'], bins=30)
plt.title('Histogram of Column Name')
plt.xlabel('Column Name')
plt.ylabel('Frequency')
plt.show()
# Box Plot
sns.boxplot(data['column_name'])
plt.title('Box Plot of Column Name')
plt.xlabel('Column Name')
plt.show()
# Scatter Plot
sns.scatterplot(x='variable_x', y='variable_y', data=data)
plt.title('Scatter Plot of Variable X vs. Variable Y')
plt.xlabel('Variable X')
plt.ylabel('Variable Y')
plt.show()
4. Correlation Analysis
Correlation analysis looks at the co-variance between the variables and sees if this is positive or negative. This assist in determining degree of dependency and multicollinearity in the data set. Pearson Correlation: It is an older measure of association, that determine linear correlation between two continuous variables. Spearman Rank Correlation: For monotonic relationships, determines the strength and direction of a linear association.
Example in python code:
# Calculate Pearson correlation matrix
correlation_matrix = data.corr()
# Visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
5. Detecting Outliers
Outliers can distort the results and conclusions that are made from the analysis of collected data. In EDA, one finds peculiar values and determines if they should be retained as they are, altered, or removed.
Z-Score Method: Find other values based on the measurement of variability, and standard deviation.
Interquartile Range (IQR): Find anomalous values according to the dispersion of the data.
Example in python code:
# Detect outliers using IQR
Q1 = data['column_name'].quantile(0.25)
Q3 = data['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['column_name'] < (Q1 - 1.5 * IQR)) | (data['column_name'] > (Q3 + 1.5 * IQR))]
# Visualize outliers
sns.boxplot(data['column_name'])
plt.title('Box Plot with Outliers')
plt.xlabel('Column Name')
plt.show()
Conclusion
In conclusion, it can be stated that Exploratory Data Analysis is a crucial step towards data analysis for the data scientist. It enables you to know what data are and make some important decisions; on top of it, it prepares the foundation for the subsequent analysis. However, for beginners, EDA is a fun exercise that is also useful for developing the skills of data science. In the subsequent chapters of this tutorial, you will discover that the more used you get to EDA, the easier it will be to work with more bigger and complex data and derive insights therefrom.
Top comments (0)