DEV Community

Cover image for Unsupervised Learning: Unveiling the Hidden Secrets in Your Data
Abhinav Yadav
Abhinav Yadav

Posted on

Unsupervised Learning: Unveiling the Hidden Secrets in Your Data

Imagine walking into a room full of people, but none of them have name tags. Unsupervised learning is like being tasked with organising them into groups based on what you see. Unlike supervised learning where we have labeled data, here we're on our own to find hidden patterns and structures.

Table Of Content

  • Introduction to Unsupervised Learning
  • Types of Unsupervised Learning
  • Practical Example: Implementing Clustering with k-Means
  • Practical Example: Dimensionality Reduction with PCA
  • Applications and Challenges of Unsupervised Learning

Introduction to Unsupervised Learning

Unsupervised learning in artificial intelligence is a type of machine learning that learns from data without human supervision. Unlike supervised learning, unsupervised machine learning models are given unlabelled data and allowed to discover patterns and insights without any explicit guidance or instruction.

This type of learning is incredibly useful for tasks like:

  • Customer segmentation: Unsupervised learning can group customers based on their buying habits, allowing businesses to target specific demographics with personalised marketing campaigns.

  • Anomaly detection: Ever wondered how spam filters catch suspicious emails? Unsupervised learning can identify outliers in data, making it perfect for detecting fraudulent transactions or security threats.

  • Data compression: Images and videos can take up a lot of storage space. Unsupervised learning can compress data by reducing its dimensions while preserving key information.

Types of Unsupervised Learning

There are two main approaches to unsupervised learning:

  1. Clustering: This is like sorting those people in the room. We group data points together based on their similarities. Popular clustering algorithms include k-Means (think of it as creating k distinct groups) and Hierarchical clustering (building a hierarchy of clusters like a family tree).

  2. Dimensionality Reduction: Sometimes, data has too many variables, making it hard to visualise or analyse. Dimensionality reduction techniques like PCA (Principal Component Analysis) help us reduce the number of variables while keeping the most important information.

Practical Example: Implementing Clustering with k-Means

Let's get hands-on! We can use k-Means clustering to group customers based on their spending habits:

1.Importing Libraries

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
Enter fullscreen mode Exit fullscreen mode

2.Sample data: customer spending habits

data = {
    'CustomerID': range(1, 11),
    'Annual Income (k$)': [15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
    'Spending Score (1-100)': [39, 81, 6, 77, 40, 76, 6, 94, 3, 72]
}
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

3.Selecting features

X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
Enter fullscreen mode Exit fullscreen mode

4.Standardising the data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

5.Applying k-Means clustering

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_scaled)
df['Cluster'] = kmeans.labels_
Enter fullscreen mode Exit fullscreen mode

6.Plotting the clusters

plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=df['Cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X')
plt.title('Customer Clusters')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Practical Example: Dimensionality Reduction with PCA

Similarly, PCA can be used to reduce the dimensions of a dataset for better visualization:

1.Importing Libraries

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

2.Sample data: customer spending habits

data = {
    'CustomerID': range(1, 11),
    'Annual Income (k$)': [15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
    'Spending Score (1-100)': [39, 81, 6, 77, 40, 76, 6, 94, 3, 72],
    'Age': [25, 34, 22, 35, 40, 30, 26, 32, 28, 45]
}
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

3.Selecting features

X = df[['Annual Income (k$)', 'Spending Score (1-100)', 'Age']]
Enter fullscreen mode Exit fullscreen mode

4.Standardising the data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

5.Applying PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Enter fullscreen mode Exit fullscreen mode

6.Explained variance

explained_variance = pca.explained_variance_ratio_
Enter fullscreen mode Exit fullscreen mode

7.Plotting the results

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1])
for i, txt in enumerate(df['CustomerID']):
    plt.annotate(txt, (X_pca[i, 0], X_pca[i, 1]))
plt.title('PCA of Customer Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

print(f"Explained variance by component: {explained_variance}")
Enter fullscreen mode Exit fullscreen mode

Applications and Challenges of Unsupervised Learning

Unsupervised learning unlocks a treasure trove of possibilities. It helps us segment markets, detect anomalies, compress data, and uncover hidden patterns in complex datasets. But like any adventure, there are challenges:

  • Choosing the right number of clusters: How many groups should we create in our k-Means example? Techniques like the Elbow Method can help us decide.

  • High-dimensional data: When dealing with many variables, it can be tricky to manage and visualize the data.

  • Interpretation: Making sense of the clusters and reduced dimensions requires careful analysis.

With careful planning and the right tools, unsupervised learning can be a powerful tool in your data science arsenal. So, next time you look at a crowd of unlabeled data, remember – there's a hidden story waiting to be discovered!

Happy Learning !

Please do comment below if you like the content or not

Have any questions or ideas or want to collaborate on a project, here is my linkedin

Top comments (0)