K-means Clustering in Machine Learning
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
A Cluster refers to collection of data points aggregated together due to certain similarities.
K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid
The below diagram explains the working of the K-means Clustering Algorithm:
How the K-means algorithm works
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
K-means algorithm example problem
We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple explanation.
Step 1: Importing important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set()
Step 2: Now let’s import the dataset and slice the important features
ds=pd.read_csv('Social_Network_Ads.csv')
ds
ds.columns
X=ds[['Age','EstimatedSalary']] #slicing important features
y=ds['Purchased']
Step 3: Plotting the data
sns.scatterplot(data=ds, x='Age' , y='EstimatedSalary', hue='Purchased
Recommended by LinkedIn
Step 4: The next step is to split our data in two different chunks, one will be used to train our data and one will be use to test the results of our model.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.20, random_state=32)
Step 5: In order to resolve this magnitude problem, we have to scale the attributes. For this we used the StandardScaler from sklearn.
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scaled=sc.fit_transform(X_train)
Step 6: The next step is to train the model:
# Fitting classifier to the Training set
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=15)
model.fit(X_train_scaled,y_train)
X_test_scaled=sc.transform(X_test)
y_pred=model.predict(X_test_scaled)
If you want to predict the classes for the new observations, you can use the following code:
# Predicting the Test set results
y_pred=model.predict(X_test)
y_test
y_pred
Step 7: The next step is to evaluate our model. For this we will use a Confusion Matrix.
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred,y_test)
The results of the confusion matrix is:
array([[44, 2],
[5, 29]], dtype=int64)
As you can see we have only 5 False Positives and only 2 False Negatives, which is a good result (here the accuracy is 91%).
Use-Cases in the Security Domain
1. Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Cyber-profiling criminals
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
3. Call record detail analysis
This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
4.Anomaly detection
Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior which can affect the security and the performance of a network.
These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in machine learning.
Some other applications of K-means Clustering :
Conclusion
K-means clustering is the unsupervised machine learning algorithm that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.
More on, similar to other unsupervised learning, it is necessary to understand the data before adopting which technique fits well on a given dataset to solve problems. Considering the correct algorithm, in return, can save time and efforts and assist in obtaining more accurate results.