What is K-means Clustering?
K-means clustering is a category of unsupervised learning. It falls under the class of clustering algorithms. Clustering algorithms find similarities in the data in order to group them into clusters.
The K in K-means represents the number of clusters that the data points are to be grouped into while the means, comes about by the fact that after creating the clusters, K-means then gets the mean of each cluster and uses them as the new centroids(center of the clusters).
The number of clusters (K) is usually predetermined. K-means clustering creates a predetermined number of clusters from an unlabeled multidimensional data.
The following two assumptions are the basis for the K-means model;
- Cluster center is the arithmetic mean of all points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.
Steps for K-means clustering
- Guess random cluster centers
- Assign points to the nearest cluster center
- Get the mean of the clusters and take them as the new cluster centers
- Repeat steps 3 and 4 until convergence (same points are assigned to the same cluster in consecutive iterations) or the new cluster centers formed do not change or until the number of iterations is reached.
How to choose the right K (number of clusters)
There are quite a number of methods that are used to choose the number of clusters when using K-means. These include; the elbow method, silhouette method, and sum of squares method among many others. We are going to discuss the elbow method in detail.
A major property of clusters is that data points in a cluster are to be similar. Meaning that the clustering algorithms are to form clusters such that intra-cluster variation(WCSS) is minimized. WCSS which in full means within cluster sum of squares is the sum of the squared distance between each member of the cluster and the centroid.
In the Elbow method, the WCSS at each number of clusters is calculated and graphed. One should choose a number of clusters such that adding another cluster doesnβt minimize the WCSS more. This will be a point where there is a change of slope from steep to shallow (an elbow).
The following are steps for performing the elbow method;
- Run K-means clustering for different values of k. For example values of K ranging from 1 to 12
- For each cluster calculate its WCSS
- Plot a graph of WCSS against the number of clusters k
- Spot the point where there is a change of slope from steep to shallow (an elbow). This will be the optimal number of clusters.
Implementing k-means with python
We are going to use the mall_customers dataset from Kaggle.
Snippet for loading the dataset and viewing the first 5 rows of the data
# loading dataset
mall_customers = pd.read_csv("data path")
mall_customers.head()
The dataset has 5 columns. We are going to use the columns 'age' and 'spending score' to make the clusters. We want to group the ages according to their spending score.
We will get the part of dataset that is needed
X = mall_customers[['Age', 'Spending Score (1-100)']]
At first, we will determine the number of clusters needed by using the elbow method. We will run K-means for different values of K in the range of 1 to 10, calculate each of their WCSS, and plot a graph.
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Plot of WCSS against number of clusters')
plt.ylabel('WCSS')
plt.xlabel('Number of clusters')
plt.show()
From the graph above, 4 is our optimal number of clusters.
We will now proceed to clustering the data into 4 clusters and visualize them.
#defining a k=4 kmeans cluster model
kmeans_4 = KMeans(n_clusters=4, random_state=0)
#fitting data into the model
assignments = kmeans_4.fit_predict(X)
#creating dictionary to assign cluster numbers to colours for visualization
col_dic = {0:'blue',1:'green',2:'orange',3:'magenta'}
#mapping cluster numbers to colours
assign_colour = [col_dic[x] for x in assignments]
#visualization
plt.scatter(X['Age'], X['Spending Score (1-100)'], color=assign_colour)
plt.ylabel('Spending score')
plt.xlabel('Age')
plt.show()
Conclusion
In this article, we have discussed K-means clustering algorithm, what it is, the steps to take when creating K-means, and used python to implement it. Hope you enjoyed reading through and found the article helpful.
Feel free to give feedback or comment if you have got any so that we all keep learning.
Top comments (0)