Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or modeling.
Goals of Exploratory Data Analysis
1. Data Cleaning: Handling missing values, removing outliers, and ensuring data quality. Data Scientists spend 80% of their time cleaning Data.
2.Data Exploration: Mostly involves identifying patterns from the cleaned Data.
3. Data Visualization: Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships within the facts.
4.Hypothesis Generation: EDA aids in generating studies questions based totally on the preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and model building.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and direction of relationships between variables.
Steps Involved in Exploratory Data Analysis.
1. Importing and Reading Data: involves importing the required libraries used in data cleaning, description, analyzation and visualization.
Lets use the titanic data set that is freely found here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
titanic_data = sns.load_dataset('titanic')
#using pandas
titanic_data = pd.read_csv("path_to_your_dataset_location")
2.Understanding the Data: Find the characteristiics of your Data, its content and stracture among others.
Example
titanic_data.shape
titanic_data.head(10)
titanic_data.dtypes
titanic_data.describe()
3. Data Preparation: Identity and Remove Duplicates, Drop irrelevant data and make it Ready for Analysis.
#Drop any column with missing data
titanics_data.dropna()
#Fill key missing Data
titanic_data.fillna(value)
4.Data Exploration: Examine statistics, visualize the data distributions and Identify patterns.
5.Feature Engineering: Involves creating and transforming new variables or features found in the dataset to improve performance of machine learning model.
6.Data Visualization: Presenting the insights derived by the features through plots, charts and graphs to communicate findings and story-tell effectively.
Types of Exploratory Data Analysis
1.Univariate Non-graphical: Here data consists of one variable and may not deal with relationships.
2. Univariate Graphical: It involves summarizing and visualizing a unmarried variable at a time to understand its distribution, relevant tendency, unfold, and different applicable records. Techniques like histograms, field plots and bar charts information are generally used in univariate analysis
Example Drawing Histograms using the titanic dataset.
import seaborn as sns
import matplotlib.pyplot as plt
titanic_data = sns.load_dataset('titanic')
plt.figure(figsize=(8, 5))
sns.histplot(titanic_data['age'].dropna(), bins=30, kde=True, color='blue')
plt.xlabel('Age')
plt.title('Distribution of Passenger Ages')
plt.show()
3. Multivariate non-Graphical: In this type, data arises from more than one variable. it shows the relationship between two or more variables of the data through cross-tabulation.
4.Multivariate Graphical: Uses Graphics to display relationships between two or more sets of Data. For example grouped bar plot or bar chart.
Drawing a grouped bar plot using the titanic data
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the Titanic dataset
titanic = sns.load_dataset('titanic')
# Grouped bar plot showing survival count by class and sex
sns.set(style="whitegrid")
g = sns.catplot(x="class", hue="sex", col="survived",
data=titanic, kind="count",
height=4, aspect=0.7, palette="pastel")
# Customize labels and title
g.set_axis_labels("Class", "Count")
plt.subplots_adjust(top=0.85)
g.fig.suptitle('Survival Count by Class and Sex')
plt.show()
Output
In this article, we have mainly focused on using python for our Exploratory data Analysis examples, One can also use R programing language too.
In Summary, EDA is a crucial phase in Every day's Data analysis procedures. It helps reveal valuable knowledge hidden within data driving businesses into asking the right questions, making better decisions through better insights and effective problem solving.
Top comments (0)