K-mean clustering
Clustering
Clustering is used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same cluster are very similar while data points in different clusters are very different.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We investigate the structure of the data by grouping the data points into distinct subgroups.
K-means clustering
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping clusters where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different as possible.
It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.
K-means algorithm works as follows:
Diagrammatic Implementation of K Means Clustering
STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the dataset and to put them into different respective clusters. We will choose some random 2 points which will act as the centroid to form the cluster.
STEP 2: Now we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids. Consider the below image:
STEP 3: points left side of the line is near to the blue centroid, and points to the right of the line are close to the yellow centroid. The left one form cluster with blue centroid and the right one with the yellow centroid.
STEP 4:repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, which is depicted below :
STEP 5: Next, we will reassign each datapoint to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step of finding new centroids.
STEP 7: We will repeat the above process of finding the center of gravity of centroids, as being depicted below.
STEP 8: After Finding the new centroids we will again draw the median line and reassign the data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point to be included in a single group
The final Cluster being formed are as Follows
Recommended by LinkedIn
Choosing The Right Number Of Clusters
The number of clusters that we choose for the algorithm shouldn’t be random. Each and Every cluster is formed by calculating and comparing the mean distances of each data points within a cluster from its centroid.
We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method.
WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.
The main idea is to minimize the distance between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances.
Use-Cases in the Security Domain
Here is a list of some of the interesting use cases of K-means in the Security Domain:
1. Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Insurance fraud detection
Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
3. Cyber-profiling criminals
Cyber profiling is the process of collecting data from individuals and groups to identify significant corelations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
4. Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
5. Automatic clustering of it alerts
Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
6. Rideshare data analysis
the publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.
7. Crime document classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify the similarity in document groups.
These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in machine learning.
MBA Student @ MAHE | Ledby Accelerator 2024
3yGreat work Deepali Mishra!!!
ATSE@RedHat || Openshift || 3x RedHat Certified || DevOps(Docker🐋, Kubernetes☸, Jenkins👨🍳) || Ansible || Cloud Computing ☁(AWS) |||
3yGood Job Deepali Mishra