Anand

Posted on Nov 20

K-Means Clustering: A Step-by-Step Guide🤖📊

Hello, Data Enthusiasts! 👋

When diving into the world of Unsupervised Learning, we encounter tasks where we aim to find hidden patterns in data without having explicit labels. One of the most popular techniques for such tasks is Clustering. Today, let’s look at K-Means Clustering and how we can implement it with a hands-on Python example! 🚀

What is Unsupervised Learning? 🤔

Unsupervised learning is a type of machine learning where the model is provided with data that has no labels. The goal here is to uncover patterns, structures, or relationships within the data. The model tries to learn from the input data without any guidance on what the output should be.

Examples include clustering, anomaly detection, and dimensionality reduction.

What is Clustering? 🧑‍🤝‍🧑

Clustering is an unsupervised learning technique that groups data points based on their similarities. The most common clustering algorithm is K-Means, where the "K" represents the number of clusters you want to divide your data into.

K-Means Clustering Algorithm: Steps 📝

Initialize centroids: Choose K random points in the data as the initial centroids (cluster centers).
Assign data points to clusters: Each data point is assigned to the nearest centroid, creating K clusters.
Update centroids: After assignment, calculate the new centroids based on the mean of the data points in each cluster.
Repeat: Steps 2 and 3 are repeated until the centroids no longer change or converge.

Now, Let’s Dive Into the Code 💻

Here’s an example of implementing K-Means Clustering using Python. I'll walk you through every step and explain what's happening at each stage!

Step 1: Importing Libraries 🧑‍🔬

import pandas 
import numpy 
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

We start by importing essential libraries.

matplotlib and seaborn are for visualizing the data.
pandas and numpy help us handle data.
filterwarnings is used to suppress any warnings in our code.

Step 2: Creating Synthetic Data 💡

from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=1000, centers=3, n_features=2)
plt.scatter(x[:, 0], x[:, 1], c=y)

Here, we generate a synthetic dataset with 1000 samples and 3 centers (clusters).

make_blobs creates 2D data with separable clusters.
plt.scatter helps us visualize the data points, with colors indicating the actual clusters.

Output:
A scatter plot shows 3 distinct clusters.

Step 3: Standardizing the Data 🔄

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Standardization is crucial in K-Means as it ensures that all features have a similar scale.
train_test_split splits the data into training and testing sets.
StandardScaler normalizes the data to a common scale.

Step 4: Elbow Method for Optimal K 🧩

from sklearn.cluster import KMeans
wscc = []

for k in range(1, 11):
    kmean = KMeans(n_clusters=k, init='k-means++')
    kmean.fit(x_train_scaled, y_train)
    wscc.append(kmean.inertia_)

In this part, we use the Elbow Method to determine the optimal number of clusters (K). The inertia measures how well the data fits into the clusters.

KMeans fits the data for different values of K (from 1 to 10).
We store the inertia values in the list wscc to evaluate the "elbow."

Output (wscc):

[1499.99, 594.74, 65.69, 58.75, 51.76, 41.96, 37.2, 34.92, 29.05, 27.66]

Step 5: Plotting the Elbow Curve 📉

plt.plot(range(1, 11), wscc)
plt.xticks(range(1, 11))
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

The Elbow Curve helps us determine the best K.

As K increases, inertia decreases.
The "elbow" is where the decrease in inertia starts to slow down, suggesting the optimal K.

Step 6: Knee Locator for K Value

from kneed import KneeLocator
kl = KneeLocator(range(1, 11), wscc, curve='convex', direction='decreasing')
kl.elbow

Using KneeLocator, we can find the "elbow" point in the curve.

The elbow method gives us the optimal K based on the inertia curve.

Output:

Thus, the optimal number of clusters is 3! 🎉

Step 7: Silhouette Score for Validation 🌟

from sklearn.metrics import silhouette_score
silhouette_coefficients = []

for k in range(2, 11):
    kmean = KMeans(n_clusters=k, init='k-means++')
    kmean.fit(x_train_scaled)
    score = silhouette_score(x_train_scaled, kmean.labels_)
    silhouette_coefficients.append(score)

plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficients')

The Silhouette Score measures how similar data points within a cluster are, compared to points in other clusters.

Higher scores indicate well-separated clusters.

Conclusion: K-Means in Action! 🚀

K-Means Clustering is a powerful technique to group similar data points into K clusters.
The Elbow Method and Silhouette Score are effective for determining the optimal K.
By using scikit-learn, you can easily implement K-Means, visualize results, and evaluate the quality of your clustering.

So, next time you're working with unsupervised data, try K-Means Clustering and see how well it works for your dataset! 😎

Happy clustering! 🎉

About Me:
🖇️LinkedIn
🧑‍💻GitHub

DEV Community

K-Means Clustering: A Step-by-Step Guide🤖📊

What is Unsupervised Learning? 🤔

What is Clustering? 🧑‍🤝‍🧑

K-Means Clustering Algorithm: Steps 📝

Now, Let’s Dive Into the Code 💻

Step 1: Importing Libraries 🧑‍🔬

Step 2: Creating Synthetic Data 💡

Step 3: Standardizing the Data 🔄

Step 4: Elbow Method for Optimal K 🧩

Step 5: Plotting the Elbow Curve 📉

Step 6: Knee Locator for K Value

Step 7: Silhouette Score for Validation 🌟

Conclusion: K-Means in Action! 🚀

Top comments (0)

Read next

AI Software Testing: Improving Quality Assurance with Artificial Intelligence

Things About Nonce & CSRF Token: Differences, Use Cases, and How They Work

NanoMD: 輕量化 Markdown 編輯器

Top Open Source Communities you should not miss out in 2025🔥