DEV Community

Cover image for Unsupervised Machine Learning: Non Text Clustering with DBSCAN
Wilbert Misingo
Wilbert Misingo

Posted on • Edited on

Unsupervised Machine Learning: Non Text Clustering with DBSCAN

Introduction

Unsupervised machine learning is a type of machine learning where the model is not provided with labeled training data. Instead, it must find patterns or relationships in the data on its own. There are different types of unsupervised learning, such as clustering and dimensionality reduction. Clustering, in particular, is the task of grouping similar examples together, without being provided with a specific target variable.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike other clustering algorithms, such as k-means, DBSCAN does not require the number of clusters to be specified in advance. Instead, it automatically discovers the number of clusters based on the density of the data. DBSCAN works by defining a dense region as one where there are at least a specified number of examples within a certain distance (epsilon) of each other. These dense regions are then used as clusters.

The DBSCAN algorithm is implemented using the DBSCAN class from the sklearn.cluster module. It takes two parameters epsilon, and min_samples. The epsilon defines the radius of the neighborhood around a point, and min_samples defines the number of points required to form a dense region. The fit method is applied on X to get the cluster labels. The variable labels contains the cluster labels for each data point.

In this article we would cover how to create cluster for non text data using DBSCAN, during the process the following libraries would be used:-

  • Sci-kit learn

  • Numpy

  • Matplotlib

The process

  1. Importing Libraries and modules
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode
  1. Defining model and data configurations
num_samples_total = 1000
cluster_centers = [(3,3), (7,7)]
num_classes = len(cluster_centers)
epsilon = 1.0
min_samples = 13

Enter fullscreen mode Exit fullscreen mode
  1. Generating training data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5)
Enter fullscreen mode Exit fullscreen mode
  1. Saving the data for future uses and loading
np.save('./clusters.npy', X)
X = np.load('./clusters.npy')
Enter fullscreen mode Exit fullscreen mode
  1. Training the model
db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = db.labels_
Enter fullscreen mode Exit fullscreen mode
  1. Getting information about the clusters
no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)

print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)
Enter fullscreen mode Exit fullscreen mode
  1. Visualizing the clusters
colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels))
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title('Two clusters with data')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()
Enter fullscreen mode Exit fullscreen mode

That's all, I hope this helps!!

Do you have a project πŸš€ that you want me to assist you email me🀝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow βœ… me on Twitter/X 𝕏
Follow βœ… me on LinkedIn πŸ’Ό

Top comments (0)