Clustering is an unsupervised machine learning concept which groups or rows of data based on their features or properties in a dataset.
In Unsupervised machine learning, there is no target feature, which means unlike supervised machine learning models that predicts a value ,clustering don’t, instead you only have the dataset to classify into separate clusters.Clustering techniques are applied to target segmentation, social network analysis, image segmentation and so on. In this article, we’ll explore one of sklearn's unsupervised machine learning algorithms which is KMeans.The link for the dataset is here
Training the model.
Import useful libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
We start by importing useful libraries:
Numpy: A python numerical library for statistical calculations.
Pandas:A python library for reading and manipulation of the dataset in csv format.
Seaborn:A python visualization library.
Sklearn:A python library for manipulation of data and supervised and unsupervised machine learning concepts. In this article, we’ll consider KMeans classifier, to segment a group of data into different classes.
Reading the data.
After importing the necessary modules, we read the data.
data=pd.read_csv('/Users/user/Downloads/covid_worldwide.csv')
We can check the contents of our data by using the pandas head command.
data.head()
Output:
Expository Data Analysis
We find out more about the data here.To get data and understand the columns we use pandas info method, to get the datatypes and number of data in each column.
data.info()
Output:
From the output, we notice some significant issues with the dataset.
- There are some missing entries or values in some columns.
- The columns are of datatype ‘object’, so we need to convert them to integers.
Data Cleaning And Preparation
In order to make our data more readable and to use it to train a clustering model, we need to solve those issues stated above.
To handle the missing values, we can drop the missing roles(This method is not advisable in some cases as it results in data loss, in future articles, we’ll see other ways we can handle missing values).
To drop the missing columns, Execute:
data.dropna(inplace=True)
Next, we want to remove the commas(,) present in the numbers in order to convert them to integers. We do so with the simple commas which takes all commas in the dataset.
data.replace(",","",regex=True , inplace=True)
We then collect the useful numerical columns into a variable named X.
X=data.drop(['Serial Number','Country'], axis=1)
So we convert the remaining column’s datatype to floats using the code below.
def tofloat(X,col):
X[col]=X[col].astype(float)
return X[col]
for col in X.columns:
X[col]=tofloat(X,col)
Then we use the pandas info() method to check our data again.
Output:
We can also use the pandas describe function to see the statistical inferences of the data.
X.describe()
Output:
Model Building
To train the model, we first load the model.
kmeans=KMeans(n_clusters=2,n_init=10)
The n_clusters parameter is to specify the number of clusters or number of groups you want to create from the dataset. The n_init parameter states the number of times Kmeans would run from different starting points in the dataset.
The next few lines of code, we use the fit predict function to both fit the dataset into the model and collect the prediction as a new column in X as a categorical datatype.We use the total cases and total deaths of covid cases to classify them.
X["Cluster"] = kmeans.fit_predict(X[['Total Cases','Total Deaths']])
X["Cluster"] = X["Cluster"].astype("category")
We can visualize the cluster in a simple way with the code.
sns.relplot(
x="Total Cases", y="Total Deaths", hue="Cluster", data=X, height=6,
);
Output:
Hope you enjoyed reading this article as i did writing it, like and feel free to comment your thoughts below.
Top comments (0)