K-means Clustering in Machine Learning

Sheersh Jain

DevOps Engineer

Published Jul 20, 2021

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.

A Cluster refers to collection of data points aggregated together due to certain similarities.

K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid

The below diagram explains the working of the K-means Clustering Algorithm:

How the K-means algorithm works

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

K-means algorithm example problem

We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple explanation.

Step 1: Importing important libraries

import numpy as np


import matplotlib.pyplot as plt


import pandas as pd 


import seaborn as sns


sns.set()

Step 2: Now let’s import the dataset and slice the important features

ds=pd.read_csv('Social_Network_Ads.csv')

ds

ds.columns
X=ds[['Age','EstimatedSalary']]   #slicing important features

y=ds['Purchased']

Step 3: Plotting the data

sns.scatterplot(data=ds, x='Age' , y='EstimatedSalary', hue='Purchased

Recommended by LinkedIn

Types of CLustering Algorithm

Shashank Sharma 2 years ago

Data Scientist’s Dilemma: The Cold Start Problem – Ten…

Kirk Borne, Ph.D. 6 years ago

🔍 Choosing the Right Clustering Algorithm

Saeed Farziani 1 month ago

Step 4: The next step is to split our data in two different chunks, one will be used to train our data and one will be use to test the results of our model.

from sklearn.model_selection import train_test_split 

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.20, random_state=32)

Step 5: In order to resolve this magnitude problem, we have to scale the attributes. For this we used the StandardScaler from sklearn.

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()

X_train_scaled=sc.fit_transform(X_train)

Step 6: The next step is to train the model:

# Fitting classifier to the Training set

from sklearn.neighbors import KNeighborsClassifier 

model=KNeighborsClassifier(n_neighbors=15) 

model.fit(X_train_scaled,y_train) 

X_test_scaled=sc.transform(X_test)

y_pred=model.predict(X_test_scaled)

If you want to predict the classes for the new observations, you can use the following code:

# Predicting the Test set results

y_pred=model.predict(X_test)

y_test
y_pred

Step 7: The next step is to evaluate our model. For this we will use a Confusion Matrix.

# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

confusion_matrix(y_pred,y_test)

The results of the confusion matrix is:

array([[44, 2],
       [5, 29]], dtype=int64)

As you can see we have only 5 False Positives and only 2 False Negatives, which is a good result (here the accuracy is 91%).

Use-Cases in the Security Domain

1. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

2. Cyber-profiling criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

3. Call record detail analysis

This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

4.Anomaly detection

Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior which can affect the security and the performance of a network.

These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in machine learning.

Some other applications of K-means Clustering :

Diagnostic systems : The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
Search engines : When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.

Conclusion

K-means clustering is the unsupervised machine learning algorithm that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.

More on, similar to other unsupervised learning, it is necessary to understand the data before adopting which technique fits well on a given dataset to solve problems. Considering the correct algorithm, in return, can save time and efforts and assist in obtaining more accurate results.

K-means Clustering in Machine Learning

Sheersh Jain

DevOps Engineer

How the K-means algorithm works

K-means algorithm example problem

Recommended by LinkedIn

Use-Cases in the Security Domain

Conclusion

Thank you for reading!!

More articles by Sheersh Jain

Insights from the community

Others also viewed

AI Atlas #7: Clustering