K-Means Clustering Algorithm.

K-Means Clustering Algorithm.

K-Means Clustering is an unsupervised learning algorithm that solves clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm.

Before we dive into algorithms, let's first understand clustering,

No alt text provided for this image


Clustering:

No alt text provided for this image


  • Clustering is a technique used in the K-means algorithm. In this algorithm, clustering refers to grouping similar data points based on their characteristics or features.


  • The goal of clustering is to partition a set of data points into distinct clusters, where each cluster consists of data points that are more similar than those in other clusters.


  • Clustering: grouping data based on similarity patterns based on distance 


  • The goal is to group similar instances into clusters. Clustering is an excellent tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more.


K-Mean Algorithm:


  • K means comes under Unsupervised learning and is also called clustering algorithm.


  • K mean is a clustering algorithm that is used to classify unlabeled data into groups/clusters based on similarity.


  • Here K defines the number of pre-defined clusters that need to be created in the process if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.


  • “ It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar properties. ”


  • It allows us to cluster the data into different groups and is a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.


  • It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.


  • K-means then tries to determine different k-points called centroids, which are at the centre (least cumulative distance) from other points of the same class, but further away from points of another class.


  • The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

No alt text provided for this image


Aim of the K-Mean Algorithm:


The k-means clustering algorithm mainly performs two tasks:


  • Determines the best value for K centre points or centroids by an iterative process.


  • Assigns each data point to its closest K-center. Those data points which are near to the particular K-center, create a cluster.


How K-Mean Algorithm Works:


  1. Plot Data
  2. Select the number K to decide the number of clusters.
  3. Select random K points or centroids. (It can be other from the input dataset).
  4. Assign each data point to their closest centroid, which will form the predefined K clusters.
  5. Repeat the fourth step, which means reassigning each data point to the new closest centroid of each cluster.

           Until you get a clearer cluster means no overlapping.


No alt text provided for this image

We will understand each figure one by one.


  • Figure 1 shows the representation of data from two different items. the first item has shown in blue colour and the second item has shown in red colour. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.


  • In Figure 2, Join the two selected points. Now to find out the centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue colour items.


  • The same process will continue in Figure 3. We will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.


  • The same process is happening in Figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

            

Main points:

  • Intercluster distance should be high: The distance between observations in two clusters should be High.
  • Intracluster Distance should be Very Less: The distance of observation within the cluster should be very less. 

No alt text provided for this image


Measuring Distance:

Euclidean Distance Measure:

The most common case is determining the distance between two points. If we have point P and point Q, the Euclidean distance is an ordinary straight line. It is the distance between the two points in Euclidean space.


The formula for the distance between two points is shown below:

No alt text provided for this image



Manhattan Distance Measure:

The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles.


The formula is shown below:


No alt text provided for this image


How to evaluate K-Mean Model?


Silhouette Coefficient:

Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.


1: Means clusters are well apart from each other and clearly distinguished.


0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.


-1: Means clusters are assigned in the wrong way.

No alt text provided for this image


Important Points:


  • It will use a Distance measure.


  • Scaling is very important


  • Handling outliers is also Important.


How to select the optimal value for k?


  • Elbow Method:

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.


No alt text provided for this image

           

          How does it work?


  • Start with some k

k=[2,3,4,5,6,...10]

  • If k=2 apply k-mean
  • It will find WCSS
  • Then step repeat for different k values
  • Plot a graph of k versus WCSS.
  • Choose the k value after which the WCSS value is constant.

No alt text provided for this image

K-mean Algorithm Overview:

No alt text provided for this image


Application OF K-Mean Algorithm:


The k-means algorithm is a popular clustering algorithm used in various fields to group data points into distinct clusters. Here are some common applications of the k-means algorithm:


  • Image Segmentation: In image processing, k-means can be used to segment an image into different regions based on colour similarity. Each cluster represents a distinct region in the image, allowing for further analysis or processing.


  • Customer Segmentation: In marketing and customer analytics, k-means can be used to segment customers into groups based on their buying behaviour, demographics, or other relevant factors. This information can help businesses target specific customer segments with tailored marketing strategies.


  • Anomaly Detection: K-means can be used to identify outliers or anomalies in a dataset. By clustering the data, the algorithm can identify data points that deviate significantly from the rest of the data, which can be useful in detecting fraudulent transactions, network intrusions, or other irregularities.


  • Document Clustering: In natural language processing (NLP), k-means can be applied to group similar documents together based on their content. This can be useful for tasks such as document organization, topic modeling, and information retrieval.


  • Recommendation Systems: K-means can be used in collaborative filtering-based recommendation systems to cluster users with similar preferences or behaviours. By identifying similar user clusters, personalized recommendations can be generated based on the preferences of users in the same cluster.


—------------------------------------------------------------------------------------------------------


If you learned something from this blog, make sure you give it a 👏🏼

Will meet you in some other Aricle, till then Peace ✌🏼.



Happy reading.


 

Thank_You..










Jeevitha D S

Principal Instructor-Data Science, Learning Operations at AlmaBetter

1y

Interesting!

To view or add a comment, sign in

More articles by Dishant Kharkar

  • "Unravelling the Power of XGBoost: Boosting Performance with Extreme Gradient Boosting"

    XGBoost is a powerful machine-learning algorithm that has been dominating the world of data science in recent years…

  • About Boosting and Gradient Boosting Algorithm…

    What is Boosting? Boosting is a machine learning ensemble technique that combines multiple weak or base models to…

  • About Random Forest Algorithms.

    What is Random Forest? Random Forest is a popular machine learning algorithm that belongs to the supervised learning…

  • About Decision Tree Algorithms...

    What is Decision Tree? A Decision Tree is a Supervised learning technique that can be used for classification and…

    2 Comments
  • About Support Vector Machine Algorithm (SVM’s)...

    Introduction: Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms. SVM is used for…

    2 Comments
  • Naïve Bayes classifiers

    What is Naïve Bayes Algorithm/Classifiers? The Naïve Bayes classifier is a supervised machine learning algorithm. which…

    2 Comments
  • What is an Outliers?? How To handle it??

    “ Do not be an ignoramus. STOP treating Outliers like Garbage, START listening to What it tells you.

  • About Logistic Regression

    About Logistic Regression After the basics of Regression, it’s time for the basics of Classification. And, what can be…

  • About Linear Regression

    Every Data Scientist starts with this one. So, here it is.

  • Introduction of Machine Learning.

    What Is Machine Learning? Machine learning is categorised as a subset of Artificial Intelligence (AI). AI Machine…

Insights from the community

Others also viewed

Explore topics