The document discusses various unsupervised learning techniques including clustering algorithms like k-means, k-medoids, hierarchical clustering and density-based clustering. It explains how k-means clustering works by selecting initial random centroids and iteratively reassigning data points to the closest centroid. The elbow method is described as a way to determine the optimal number of clusters k. The document also discusses how k-medoids clustering is more robust to outliers than k-means because it uses actual data points as cluster representatives rather than centroids.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
The document discusses the concept of clustering, which is an unsupervised machine learning technique used to group unlabeled data points that are similar. It describes how clustering algorithms aim to identify natural groups within data based on some measure of similarity, without any labels provided. The key types of clustering are partition-based (like k-means), hierarchical, density-based, and model-based. Applications include marketing, earth science, insurance, and more. Quality measures for clustering include intra-cluster similarity and inter-cluster dissimilarity.
This document discusses unsupervised machine learning techniques for clustering unlabeled data. It covers k-means clustering, which partitions data into k groups based on minimizing distance between points and cluster centroids. It also discusses agglomerative hierarchical clustering, which successively merges clusters based on their distance. As an example, it shows hierarchical clustering of texture images from five classes to group similar textures.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
This document provides an overview of clustering and k-means clustering algorithms. It begins by defining clustering as the process of grouping similar objects together and dissimilar objects separately. K-means clustering is introduced as an algorithm that partitions data points into k clusters by minimizing total intra-cluster variance, iteratively updating cluster means. The k-means algorithm and an example are described in detail. Weaknesses and applications are discussed. Finally, vector quantization and principal component analysis are briefly introduced.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
Document clustering for forensic analysis an approach for improving compute...Madan Golla
The document proposes an approach to apply document clustering algorithms to forensic analysis of computers seized in police investigations. It discusses using six representative clustering algorithms - K-means, K-medoids, Single/Complete/Average Link hierarchical clustering, and CSPA ensemble clustering. The approach estimates the number of clusters automatically from the data using validity indexes like silhouette, in order to facilitate computer inspection and speed up the analysis process compared to examining each document individually.
Unsupervised learning techniques like clustering are used to explore intrinsic structures in unlabeled data and group similar data instances together. Clustering algorithms like k-means partition data into k clusters where each cluster has a centroid, and data points are assigned to the closest centroid. Hierarchical clustering creates nested clusters by iteratively merging or splitting clusters based on distance metrics. Choosing the right distance metric and clustering algorithm depends on factors like attribute ranges and presence of outliers.
Cluster analysis, or clustering, is the process of grouping data objects into subsets called clusters so that objects within a cluster are similar to each other but dissimilar to objects in other clusters. There are several approaches to clustering, including partitioning, hierarchical, density-based, and grid-based methods. The k-means and k-medoids algorithms are popular partitioning methods that aim to partition observations into k clusters by minimizing distances between observations and cluster centroids or medoids. K-medoids is more robust to outliers as it uses actual observations as cluster representatives rather than centroids. Both methods require specifying the number of clusters k in advance.
This document discusses various clustering methods used in data mining. It begins with an overview of clustering and its applications. It then describes five major categories of clustering methods: partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative nesting and divisive analysis, density-based methods, grid-based methods, and model-based clustering methods. For each category, popular algorithms are provided as examples. The document also covers types of data for clustering and evaluating clustering results.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
This document provides information about clustering and cluster analysis. It begins by defining clustering as the process of grouping objects into classes of similar objects. It then discusses what a cluster is and different types of clustering techniques, including partitioning methods like k-means clustering. K-means clustering is explained as an algorithm that assigns objects to clusters based on minimizing distance between objects and cluster centers, then updating the cluster centers. Examples are provided to demonstrate how k-means clustering works on a sample dataset.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. It involves finding groups of objects such that objects within a cluster are more similar to each other than objects in different clusters. The key goals of cluster analysis are to maximize intra-cluster similarity while minimizing inter-cluster similarity. Common applications of cluster analysis include market segmentation, document classification, and identifying homogeneous groups in biological data.
In this paper, the cost and weight of the reinforcement concrete cantilever retaining wall are optimized using Gases Brownian Motion Optimization Algorithm (GBMOA) which is based on the gas molecules motion. To investigate the optimization capability of the GBMOA, two objective functions of cost and weight are considered and verification is made using two available solutions for retaining wall design. Furthermore, the effect of wall geometries of retaining walls on their cost and weight is investigated using four different T-shape walls. Besides, sensitivity analyses for effects of backfill slope, stem height, surcharge, and backfill unit weight are carried out and of soil. Moreover, Rankine and Coulomb methods for lateral earth pressure calculation are used and results are compared. The GBMOA predictions are compared with those available in the literature. It has been shown that the use of GBMOA results in reducing significantly the cost and weight of retaining walls. In addition, the Coulomb lateral earth pressure can reduce the cost and weight of retaining walls.
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...ijdmsjournal
Agile methodologies have transformed organizational management by prioritizing team autonomy and
iterative learning cycles. However, these approaches often lack structured mechanisms for knowledge
retention and interoperability, leading to fragmented decision-making, information silos, and strategic
misalignment. This study proposes an alternative approach to knowledge management in Agile
environments by integrating Ikujiro Nonaka and Hirotaka Takeuchi’s theory of knowledge creation—
specifically the concept of Ba, a shared space where knowledge is created and validated—with Jürgen
Habermas’s Theory of Communicative Action, which emphasizes deliberation as the foundation for trust
and legitimacy in organizational decision-making. To operationalize this integration, we propose the
Deliberative Permeability Metric (DPM), a diagnostic tool that evaluates knowledge flow and the
deliberative foundation of organizational decisions, and the Communicative Rationality Cycle (CRC), a
structured feedback model that extends the DPM, ensuring long-term adaptability and data governance.
This model was applied at Livelo, a Brazilian loyalty program company, demonstrating that structured
deliberation improves operational efficiency and reduces knowledge fragmentation. The findings indicate
that institutionalizing deliberative processes strengthens knowledge interoperability, fostering a more
resilient and adaptive approach to data governance in complex organizations.
Ad
More Related Content
Similar to 4 DM Clustering ifor computerscience.ppt (20)
This document provides an overview of clustering and k-means clustering algorithms. It begins by defining clustering as the process of grouping similar objects together and dissimilar objects separately. K-means clustering is introduced as an algorithm that partitions data points into k clusters by minimizing total intra-cluster variance, iteratively updating cluster means. The k-means algorithm and an example are described in detail. Weaknesses and applications are discussed. Finally, vector quantization and principal component analysis are briefly introduced.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
Document clustering for forensic analysis an approach for improving compute...Madan Golla
The document proposes an approach to apply document clustering algorithms to forensic analysis of computers seized in police investigations. It discusses using six representative clustering algorithms - K-means, K-medoids, Single/Complete/Average Link hierarchical clustering, and CSPA ensemble clustering. The approach estimates the number of clusters automatically from the data using validity indexes like silhouette, in order to facilitate computer inspection and speed up the analysis process compared to examining each document individually.
Unsupervised learning techniques like clustering are used to explore intrinsic structures in unlabeled data and group similar data instances together. Clustering algorithms like k-means partition data into k clusters where each cluster has a centroid, and data points are assigned to the closest centroid. Hierarchical clustering creates nested clusters by iteratively merging or splitting clusters based on distance metrics. Choosing the right distance metric and clustering algorithm depends on factors like attribute ranges and presence of outliers.
Cluster analysis, or clustering, is the process of grouping data objects into subsets called clusters so that objects within a cluster are similar to each other but dissimilar to objects in other clusters. There are several approaches to clustering, including partitioning, hierarchical, density-based, and grid-based methods. The k-means and k-medoids algorithms are popular partitioning methods that aim to partition observations into k clusters by minimizing distances between observations and cluster centroids or medoids. K-medoids is more robust to outliers as it uses actual observations as cluster representatives rather than centroids. Both methods require specifying the number of clusters k in advance.
This document discusses various clustering methods used in data mining. It begins with an overview of clustering and its applications. It then describes five major categories of clustering methods: partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative nesting and divisive analysis, density-based methods, grid-based methods, and model-based clustering methods. For each category, popular algorithms are provided as examples. The document also covers types of data for clustering and evaluating clustering results.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
This document provides information about clustering and cluster analysis. It begins by defining clustering as the process of grouping objects into classes of similar objects. It then discusses what a cluster is and different types of clustering techniques, including partitioning methods like k-means clustering. K-means clustering is explained as an algorithm that assigns objects to clusters based on minimizing distance between objects and cluster centers, then updating the cluster centers. Examples are provided to demonstrate how k-means clustering works on a sample dataset.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. It involves finding groups of objects such that objects within a cluster are more similar to each other than objects in different clusters. The key goals of cluster analysis are to maximize intra-cluster similarity while minimizing inter-cluster similarity. Common applications of cluster analysis include market segmentation, document classification, and identifying homogeneous groups in biological data.
In this paper, the cost and weight of the reinforcement concrete cantilever retaining wall are optimized using Gases Brownian Motion Optimization Algorithm (GBMOA) which is based on the gas molecules motion. To investigate the optimization capability of the GBMOA, two objective functions of cost and weight are considered and verification is made using two available solutions for retaining wall design. Furthermore, the effect of wall geometries of retaining walls on their cost and weight is investigated using four different T-shape walls. Besides, sensitivity analyses for effects of backfill slope, stem height, surcharge, and backfill unit weight are carried out and of soil. Moreover, Rankine and Coulomb methods for lateral earth pressure calculation are used and results are compared. The GBMOA predictions are compared with those available in the literature. It has been shown that the use of GBMOA results in reducing significantly the cost and weight of retaining walls. In addition, the Coulomb lateral earth pressure can reduce the cost and weight of retaining walls.
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...ijdmsjournal
Agile methodologies have transformed organizational management by prioritizing team autonomy and
iterative learning cycles. However, these approaches often lack structured mechanisms for knowledge
retention and interoperability, leading to fragmented decision-making, information silos, and strategic
misalignment. This study proposes an alternative approach to knowledge management in Agile
environments by integrating Ikujiro Nonaka and Hirotaka Takeuchi’s theory of knowledge creation—
specifically the concept of Ba, a shared space where knowledge is created and validated—with Jürgen
Habermas’s Theory of Communicative Action, which emphasizes deliberation as the foundation for trust
and legitimacy in organizational decision-making. To operationalize this integration, we propose the
Deliberative Permeability Metric (DPM), a diagnostic tool that evaluates knowledge flow and the
deliberative foundation of organizational decisions, and the Communicative Rationality Cycle (CRC), a
structured feedback model that extends the DPM, ensuring long-term adaptability and data governance.
This model was applied at Livelo, a Brazilian loyalty program company, demonstrating that structured
deliberation improves operational efficiency and reduces knowledge fragmentation. The findings indicate
that institutionalizing deliberative processes strengthens knowledge interoperability, fostering a more
resilient and adaptive approach to data governance in complex organizations.
Construction Materials (Paints) in Civil EngineeringLavish Kashyap
This file will provide you information about various types of Paints in Civil Engineering field under Construction Materials.
It will be very useful for all Civil Engineering students who wants to search about various Construction Materials used in Civil Engineering field.
Paint is a vital construction material used for protecting surfaces and enhancing the aesthetic appeal of buildings and structures. It consists of several components, including pigments (for color), binders (to hold the pigment together), solvents or thinners (to adjust viscosity), and additives (to improve properties like durability and drying time).
Paint is one of the material used in Civil Engineering field. It is especially used in final stages of construction project.
Paint plays a dual role in construction: it protects building materials and contributes to the overall appearance and ambiance of a space.
Newly poured concrete opposing hot and windy conditions is considerably susceptible to plastic shrinkage cracking. Crack-free concrete structures are essential in ensuring high level of durability and functionality as cracks allow harmful instances or water to penetrate in the concrete resulting in structural damages, e.g. reinforcement corrosion or pressure application on the crack sides due to water freezing effect. Among other factors influencing plastic shrinkage, an important one is the concrete surface humidity evaporation rate. The evaporation rate is currently calculated in practice by using a quite complex Nomograph, a process rather tedious, time consuming and prone to inaccuracies. In response to such limitations, three analytical models for estimating the evaporation rate are developed and evaluated in this paper on the basis of the ACI 305R-10 Nomograph for “Hot Weather Concreting”. In this direction, several methods and techniques are employed including curve fitting via Genetic Algorithm optimization and Artificial Neural Networks techniques. The models are developed and tested upon datasets from two different countries and compared to the results of a previous similar study. The outcomes of this study indicate that such models can effectively re-develop the Nomograph output and estimate the concrete evaporation rate with high accuracy compared to typical curve-fitting statistical models or models from the literature. Among the proposed methods, the optimization via Genetic Algorithms, individually applied at each estimation process step, provides the best fitting result.
The TRB AJE35 RIIM Coordination and Collaboration Subcommittee has organized a series of webinars focused on building coordination, collaboration, and cooperation across multiple groups. All webinars have been recorded and copies of the recording, transcripts, and slides are below. These resources are open-access following creative commons licensing agreements. The files may be found, organized by webinar date, below. The committee co-chairs would welcome any suggestions for future webinars. The support of the AASHTO RAC Coordination and Collaboration Task Force, the Council of University Transportation Centers, and AUTRI’s Alabama Transportation Assistance Program is gratefully acknowledged.
This webinar overviews proven methods for collaborating with USDOT University Transportation Centers (UTCs), emphasizing state departments of transportation and other stakeholders. It will cover partnerships at all UTC stages, from the Notice of Funding Opportunity (NOFO) release through proposal development, research and implementation. Successful USDOT UTC research, education, workforce development, and technology transfer best practices will be highlighted. Dr. Larry Rilett, Director of the Auburn University Transportation Research Institute will moderate.
For more information, visit: https://aub.ie/trbwebinars
2. 2
Clustering
• Given a set of points, with a notion
of distance between points, group
the points into some number of
clusters, so that members of a
cluster are in some sense as close
to each other as possible.
• While data points in the same
cluster are similar, those in
separate clusters are dissimilar to
one another.
x x
x x x x
x x x x
x x x
x x
x
x x x
x x x
x x x
x x x x
x x
x x x x
x x x
x
• Clustering is a data mining (machine learning) technique
that finds similarities between data according to the
characteristics found in the data & groups similar data
objects into one cluster
3. Example: clustering
• The example below demonstrates the clustering of padlocks
of same kind. There are a total of 10 padlocks which various
in color, size, shape, etc.
• How many possible clusters of padlocks can be identified?
– There are three different kind of padlocks; which can be grouped into
three different clusters.
– The padlocks of same kind are clustered into a group as shown
below:
4. Example: Clustering Application
• Text/Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach:
• Identify content-bearing terms in each document.
• Form a similarity measure based on the frequencies
of different terms and use it to cluster documents.
– Application:
• Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
5. 5
Quality: What Is Good Clustering?
• The quality of a clustering result depends
on both the similarity measure used by
the method and its implementation
– Key requirement of clustering: Need a
good measure of similarity between
instances.
• The quality of a clustering method is also
measured by its ability to discover some
or all of the hidden patterns in the given
datasets
• A good clustering method will produce
high quality clusters with
–high intra-class similarity
–low inter-class similarity
Inter
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
6. 7
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to
evaluate because
– We do not know the correct clusters/classes
• Some methods are used:
– User inspection
• Study centroids of the cluster, and spreads of data
items in each cluster
• For text documents, one can read some documents in
clusters to evaluate the quality of clustering
algorithms employed.
7. 8
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.
• After clustering, a confusion matrix is
constructed. From the matrix, we compute
various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck).
The clustering method produces k clusters, which
divides D into k disjoint subsets, D1, D2, …, Dk.
8. Evaluation of Cluster Quality using Purity
• Quality measured by its ability to discover some or all of the
hidden patterns or latent classes in gold standard data
• Assesses a clustering with respect to ground truth …
requires labeled data
• Assume documents with C gold standard classes, while our
clustering algorithms produce K clusters, ω1, ω2, …, ωK with
ni members
• Simple measure: purity, the ratio between the dominant
class in the cluster πi and the size of cluster ωi
• Others are entropy of classes in clusters (or mutual
information between classes and clusters)
C
j
n
n
Purity ij
j
i
i
)
(
max
1
)
(
9.
Cluster I Cluster II Cluster III
• Assume that we cluster three category of data items (those
colored with red, blue and green) into three clusters as shown
in the above figures. Calculate purity to measure the quality of
each cluster.
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83%
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67%
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60%
Purity example
10. 13
Indirect Evaluation
• In some applications, clustering is not the primary task,
but used to help perform another task.
• We can use the performance on the primary task to
compare clustering methods.
• For instance, in an application, the primary task is to
provide recommendations on book purchasing to online
shoppers.
– If we can cluster books according to their features, we might
be able to provide better recommendations.
– We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
– Here, we assume that the recommendation can be reliably
evaluated.
11. 14
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of “distance”
or “nearness measurement” between data points.
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Popular similarity measure is: Minkowski distance:
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of
the data object; q= 1,2,3,…
• If q = 1, dis is Manhattan distance
q
q
i
y
i
x
n
i
Y
X
dis |)
(|
1
)
,
(
|
(|
)
,
(
1
i
i
n
i
y
x
Y
X
dis
12. 16
Similarity and Dissimilarity Between Objects
• If q = 2, dis is Euclidean distance:
• Cosine Similarity
– If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:
where indicates vector dot product, ||xi|| the length of
vector d
i
y
i
x
i
y
i
x
Y
X
dis
)
,
(
2
|)
(|
1
)
,
(
i
y
i
x
n
i
Y
X
dis
14. The need for representative
• Key problem: as you build clusters, how do you
represent the location of each cluster, to tell
which pair of clusters is closest?
• For each cluster assign a centroid (closest to all
other points)= average of its points.
• Measure intercluster distances by distances of centroids.
N
C
N
i ip
m
C
)
(
1
15. 19
Major Clustering Approaches
• Partitioning clustering approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods:
• distance-based: K-means clustering
• model-based: expectation maximization (EM) clustering.
• Hierarchical clustering approach:
– Create a hierarchical decomposition of the set of data (or
objects) using some criterion
– Typical methods:
• agglomerative Vs divisive
• single link Vs complete link
16. 20
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
17. 21
The K-Means Clustering Method
• Algorithm:
• Select K cluster points as initial centroids (the initial
centroids are selected randomly)
– Given k, the k-means algorithm is implemented as
follows:
• Repeat
–Partition objects into k nonempty subsets
–Recompute the centroids of each K clusters of
the current partition (the centroid is the center,
i.e., mean point, of the cluster)
–Assign each object to the cluster with the nearest
seed point
• Until the centroid don’t change
19. Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters : A1(2,
10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)
A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).
• The distance function between two points a=(x1,
y1) and b=(x2, y2) is defined as:
dis(a, b) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
20. Iteration 1
(2,10) (5, 8) (1, 2)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
First we list all points in the first column of the table below. The
initial cluster centers – centroids, are (2, 10), (5, 8) and (1, 2) -
chosen randomly.
Next, we will calculate the distance from each points to each of
the three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
21. Iteration 1
• Starting from point A1 calculate the distance to each of the three means, by
using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each point in one of the
clusters
22. Iteration 1
• Next, we need to re-compute the new cluster centers (means). We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we only have one point A1(2, 10), which was the old
mean, so the cluster center remains the same.
• For Cluster 2, we have five points and needs to take average of
them as new centroid, i,e.
( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the centroids do not change anymore.
– In Iteration2, we basically repeat the process from Iteration1
this time using the new means we computed.
23. Second epoch
• Using the new centroid we have to compute cluster members.
• After the 2nd
epoch the results would be:
cluster 1: {A1,A8} with new centroid=(3,9.5);
cluster 2: {A3,A4,A5,A6} with new centroid=(6.5,5.25);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(2,10) (6, 6) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
24. Third epoch
• Using the new centroid we have to compute cluster members.
• After the 3rd
epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(3,9.5) (6.5, 5.25) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
25. Fourth epoch
• Using the new centroid we have to compute cluster members.
• After the 3rd
epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(3.66,9) (7, 4.33) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
26. Final results
• Finally in the 4th
epoch there is no change of members of
clusters and centroids. So the algoithrm stops.
• The result of clustering is shown in the following figure
27. 31
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Weakness
–Applicable only when mean is defined, then what about
categorical data? Use hierarchical clustering
• Need to specify k, the number of clusters, in advance
–Unable to handle noisy data and outliers Since an object with an
extremely large value may substantially distort the distribution
of the data.
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
28. Hierarchical Clustering
• As compared to partitioning algorithm, in
hierarchical clustering the data are not
partitioned into a particular cluster in a
single step.
–Instead, a series of partitions takes place,
which may run from a single cluster containing
all objects to n clusters each containing a
single object.
• Produces a set of nested clusters organized
as a hierarchical tree.
–Hierarchical clustering outputs a hierarchy, a
structure that is more informative than the
unstructured set of clusters returned by
partitioning clustering.
–Can be visualized as a dendrogram; a tree like
diagram that records the sequences of merges
or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
29. 34
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
31. Two main types of hierarchical clustering
• Agglomerative: it is a Bottom Up clustering technique
–Start with all sample units in n clusters of size 1.
–Then, at each step of the algorithm, the pair of clusters with the shortest
distance are combined into a single cluster.
–The algorithm stops when all sample units are grouped into one cluster of size n.
• Divisive: it is a Top Down clustering technique
–Start with all sample units in a single cluster of size n.
–Then, at each step of the algorithm, clusters are partitioned into a pair of
daughter clusters, selected to maximize the distance between each daughter.
–The algorithm stops when sample units are partitioned into n clusters of size 1.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
32. Agglomerative Clustering
Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the proximity matrix
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
33. Example
• Perform a agglomerative clustering of five samples using
two features X and Y. Calculate Manhattan distance
between each pair of samples to measure their
similarity.
• Assignment: apply divisive clustering of five samples given
above. Calculate Manhattan distance between each pair of
samples to measure their similarity/dissimilarity.
Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
35. Exercise: Hierarchical clustering
• Using centroid apply agglomerative clustering
algorithm to cluster the following 8 examples.
Show the dendrograms.
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
• Also try to use single-link, complete-link,
average-link agglomerative clustering to
cluster the above given data?
36. Strengths of Hierarchical Clustering
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendogram at the proper level
• They may correspond to meaningful
taxonomies
– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Editor's Notes
#17: 09/09/25MK: New example (previous one was from Tan’s book).
Chapter 3: Statistics Methods
Co-variance distance
K-L divergence