SlideShare a Scribd company logo
Clustering
• Clustering Quality
• Partitioning clustering
• Hierarchical clustering
2
Clustering
• Given a set of points, with a notion
of distance between points, group
the points into some number of
clusters, so that members of a
cluster are in some sense as close
to each other as possible.
• While data points in the same
cluster are similar, those in
separate clusters are dissimilar to
one another.
x x
x x x x
x x x x
x x x
x x
x
x x x
x x x
x x x
x x x x
x x
x x x x
x x x
x
• Clustering is a data mining (machine learning) technique
that finds similarities between data according to the
characteristics found in the data & groups similar data
objects into one cluster
Example: clustering
• The example below demonstrates the clustering of padlocks
of same kind. There are a total of 10 padlocks which various
in color, size, shape, etc.
• How many possible clusters of padlocks can be identified?
– There are three different kind of padlocks; which can be grouped into
three different clusters.
– The padlocks of same kind are clustered into a group as shown
below:
Example: Clustering Application
• Text/Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach:
• Identify content-bearing terms in each document.
• Form a similarity measure based on the frequencies
of different terms and use it to cluster documents.
– Application:
• Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
5
Quality: What Is Good Clustering?
• The quality of a clustering result depends
on both the similarity measure used by
the method and its implementation
– Key requirement of clustering: Need a
good measure of similarity between
instances.
• The quality of a clustering method is also
measured by its ability to discover some
or all of the hidden patterns in the given
datasets
• A good clustering method will produce
high quality clusters with
–high intra-class similarity
–low inter-class similarity
Inter
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
7
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to
evaluate because
– We do not know the correct clusters/classes
• Some methods are used:
– User inspection
• Study centroids of the cluster, and spreads of data
items in each cluster
• For text documents, one can read some documents in
clusters to evaluate the quality of clustering
algorithms employed.
8
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.
• After clustering, a confusion matrix is
constructed. From the matrix, we compute
various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck).
The clustering method produces k clusters, which
divides D into k disjoint subsets, D1, D2, …, Dk.
Evaluation of Cluster Quality using Purity
• Quality measured by its ability to discover some or all of the
hidden patterns or latent classes in gold standard data
• Assesses a clustering with respect to ground truth …
requires labeled data
• Assume documents with C gold standard classes, while our
clustering algorithms produce K clusters, ω1, ω2, …, ωK with
ni members
• Simple measure: purity, the ratio between the dominant
class in the cluster πi and the size of cluster ωi
• Others are entropy of classes in clusters (or mutual
information between classes and clusters)
C
j
n
n
Purity ij
j
i
i 
 )
(
max
1
)
(
 
 
 
 
 
 
 
 

Cluster I Cluster II Cluster III
• Assume that we cluster three category of data items (those
colored with red, blue and green) into three clusters as shown
in the above figures. Calculate purity to measure the quality of
each cluster.
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83%
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67%
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60%
Purity example
13
Indirect Evaluation
• In some applications, clustering is not the primary task,
but used to help perform another task.
• We can use the performance on the primary task to
compare clustering methods.
• For instance, in an application, the primary task is to
provide recommendations on book purchasing to online
shoppers.
– If we can cluster books according to their features, we might
be able to provide better recommendations.
– We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
– Here, we assume that the recommendation can be reliably
evaluated.
14
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of “distance”
or “nearness measurement” between data points.
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Popular similarity measure is: Minkowski distance:
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of
the data object; q= 1,2,3,…
• If q = 1, dis is Manhattan distance
q
q
i
y
i
x
n
i
Y
X
dis |)
(|
1
)
,
( 

 
|
(|
)
,
(
1
i
i
n
i
y
x
Y
X
dis 



16
Similarity and Dissimilarity Between Objects
• If q = 2, dis is Euclidean distance:
• Cosine Similarity
– If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:
where  indicates vector dot product, ||xi|| the length of
vector d
i
y
i
x
i
y
i
x
Y
X
dis



)
,
(
2
|)
(|
1
)
,
(
i
y
i
x
n
i
Y
X
dis 

 
17
Example: Similarity measure
• Ex: Find the similarity between documents 1 and 2.
d1
= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2
= (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1
d2
= 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1
||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)½
=(42)½
= 6.481
||d2
||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)½
=
(17)½
= 4.12
cos(d1
, d2
) = 0.94
The need for representative
• Key problem: as you build clusters, how do you
represent the location of each cluster, to tell
which pair of clusters is closest?
• For each cluster assign a centroid (closest to all
other points)= average of its points.
• Measure intercluster distances by distances of centroids.
N
C
N
i ip
m
C
)
(
1



19
Major Clustering Approaches
• Partitioning clustering approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods:
• distance-based: K-means clustering
• model-based: expectation maximization (EM) clustering.
• Hierarchical clustering approach:
– Create a hierarchical decomposition of the set of data (or
objects) using some criterion
– Typical methods:
• agglomerative Vs divisive
• single link Vs complete link
20
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
21
The K-Means Clustering Method
• Algorithm:
• Select K cluster points as initial centroids (the initial
centroids are selected randomly)
– Given k, the k-means algorithm is implemented as
follows:
• Repeat
–Partition objects into k nonempty subsets
–Recompute the centroids of each K clusters of
the current partition (the centroid is the center,
i.e., mean point, of the cluster)
–Assign each object to the cluster with the nearest
seed point
• Until the centroid don’t change
22
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily
choose K object
as initial cluster
center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
reassign
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters : A1(2,
10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)
A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).
• The distance function between two points a=(x1,
y1) and b=(x2, y2) is defined as:
dis(a, b) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
(2,10) (5, 8) (1, 2)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
First we list all points in the first column of the table below. The
initial cluster centers – centroids, are (2, 10), (5, 8) and (1, 2) -
chosen randomly.
Next, we will calculate the distance from each points to each of
the three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the three means, by
using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each point in one of the
clusters
Iteration 1
• Next, we need to re-compute the new cluster centers (means). We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we only have one point A1(2, 10), which was the old
mean, so the cluster center remains the same.
• For Cluster 2, we have five points and needs to take average of
them as new centroid, i,e.
( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the centroids do not change anymore.
– In Iteration2, we basically repeat the process from Iteration1
this time using the new means we computed.
Second epoch
• Using the new centroid we have to compute cluster members.
• After the 2nd
epoch the results would be:
cluster 1: {A1,A8} with new centroid=(3,9.5);
cluster 2: {A3,A4,A5,A6} with new centroid=(6.5,5.25);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(2,10) (6, 6) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
Third epoch
• Using the new centroid we have to compute cluster members.
• After the 3rd
epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(3,9.5) (6.5, 5.25) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
Fourth epoch
• Using the new centroid we have to compute cluster members.
• After the 3rd
epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
(3.66,9) (7, 4.33) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
Final results
• Finally in the 4th
epoch there is no change of members of
clusters and centroids. So the algoithrm stops.
• The result of clustering is shown in the following figure
31
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Weakness
–Applicable only when mean is defined, then what about
categorical data? Use hierarchical clustering
• Need to specify k, the number of clusters, in advance
–Unable to handle noisy data and outliers Since an object with an
extremely large value may substantially distort the distribution
of the data.
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
Hierarchical Clustering
• As compared to partitioning algorithm, in
hierarchical clustering the data are not
partitioned into a particular cluster in a
single step.
–Instead, a series of partitions takes place,
which may run from a single cluster containing
all objects to n clusters each containing a
single object.
• Produces a set of nested clusters organized
as a hierarchical tree.
–Hierarchical clustering outputs a hierarchy, a
structure that is more informative than the
unstructured set of clusters returned by
partitioning clustering.
–Can be visualized as a dendrogram; a tree like
diagram that records the sequences of merges
or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
34
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
Example of hierarchical clustering
Two main types of hierarchical clustering
• Agglomerative: it is a Bottom Up clustering technique
–Start with all sample units in n clusters of size 1.
–Then, at each step of the algorithm, the pair of clusters with the shortest
distance are combined into a single cluster.
–The algorithm stops when all sample units are grouped into one cluster of size n.
• Divisive: it is a Top Down clustering technique
–Start with all sample units in a single cluster of size n.
–Then, at each step of the algorithm, clusters are partitioned into a pair of
daughter clusters, selected to maximize the distance between each daughter.
–The algorithm stops when sample units are partitioned into n clusters of size 1.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
Agglomerative Clustering
Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the proximity matrix
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Example
• Perform a agglomerative clustering of five samples using
two features X and Y. Calculate Manhattan distance
between each pair of samples to measure their
similarity.
• Assignment: apply divisive clustering of five samples given
above. Calculate Manhattan distance between each pair of
samples to measure their similarity/dissimilarity.
Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Proximity Matrix: First epoch
1 2 3 4 5
1 = (4,4)
2= (8,4)
3=(15,8)
4=(24,4)
5=(24,12)
Exercise: Hierarchical clustering
• Using centroid apply agglomerative clustering
algorithm to cluster the following 8 examples.
Show the dendrograms.
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
• Also try to use single-link, complete-link,
average-link agglomerative clustering to
cluster the above given data?
Strengths of Hierarchical Clustering
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendogram at the proper level
• They may correspond to meaningful
taxonomies
– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Ad

More Related Content

Similar to 4 DM Clustering ifor computerscience.ppt (20)

CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
sandeepsandy494692
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
clustering using different methods in .pdf
clustering using different methods in .pdfclustering using different methods in .pdf
clustering using different methods in .pdf
officialnovice7
 
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
COMPUTER VISION UNIT 4 BSC CS WITH AI  MADRAS UNIVERSITYCOMPUTER VISION UNIT 4 BSC CS WITH AI  MADRAS UNIVERSITY
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
jayalakshmimcastaff
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
UniT_A_Clustering machine learning .ppt
UniT_A_Clustering machine learning  .pptUniT_A_Clustering machine learning  .ppt
UniT_A_Clustering machine learning .ppt
HarshPanchal455289
 
Chapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaningChapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaning
my123lapto
 
Chapter -10-Clus_Basic.ppt -DataMinning
Chapter -10-Clus_Basic.ppt  -DataMinningChapter -10-Clus_Basic.ppt  -DataMinning
Chapter -10-Clus_Basic.ppt -DataMinning
nayabkainat470
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
VIKASGUPTA127897
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
ChaitanyaKulkarni451137
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
K means ALGORITHM IN MACHINE LEARNING.pptx
K means ALGORITHM IN MACHINE LEARNING.pptxK means ALGORITHM IN MACHINE LEARNING.pptx
K means ALGORITHM IN MACHINE LEARNING.pptx
angelinjeba6
 
k-mean-clustering.pdf
k-mean-clustering.pdfk-mean-clustering.pdf
k-mean-clustering.pdf
YatharthKhichar1
 
ClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdfClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
clustering using different methods in .pdf
clustering using different methods in .pdfclustering using different methods in .pdf
clustering using different methods in .pdf
officialnovice7
 
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
COMPUTER VISION UNIT 4 BSC CS WITH AI  MADRAS UNIVERSITYCOMPUTER VISION UNIT 4 BSC CS WITH AI  MADRAS UNIVERSITY
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
jayalakshmimcastaff
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
UniT_A_Clustering machine learning .ppt
UniT_A_Clustering machine learning  .pptUniT_A_Clustering machine learning  .ppt
UniT_A_Clustering machine learning .ppt
HarshPanchal455289
 
Chapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaningChapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaning
my123lapto
 
Chapter -10-Clus_Basic.ppt -DataMinning
Chapter -10-Clus_Basic.ppt  -DataMinningChapter -10-Clus_Basic.ppt  -DataMinning
Chapter -10-Clus_Basic.ppt -DataMinning
nayabkainat470
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
VIKASGUPTA127897
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
K means ALGORITHM IN MACHINE LEARNING.pptx
K means ALGORITHM IN MACHINE LEARNING.pptxK means ALGORITHM IN MACHINE LEARNING.pptx
K means ALGORITHM IN MACHINE LEARNING.pptx
angelinjeba6
 
ClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdfClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 

Recently uploaded (20)

Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Journal of Soft Computing in Civil Engineering
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Construction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil EngineeringConstruction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil Engineering
Lavish Kashyap
 
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
ssuserd9338b
 
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptxUnleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
SanjeetMishra29
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Introduction to Additive Manufacturing(3D printing)
Introduction to Additive Manufacturing(3D printing)Introduction to Additive Manufacturing(3D printing)
Introduction to Additive Manufacturing(3D printing)
vijimech408
 
Understand water laser communication using Arduino laser and solar panel
Understand water laser communication using Arduino laser and solar panelUnderstand water laser communication using Arduino laser and solar panel
Understand water laser communication using Arduino laser and solar panel
NaveenBotsa
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation RateModeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Journal of Soft Computing in Civil Engineering
 
Zeiss-Ultra-Optimeter metrology subject.pdf
Zeiss-Ultra-Optimeter metrology subject.pdfZeiss-Ultra-Optimeter metrology subject.pdf
Zeiss-Ultra-Optimeter metrology subject.pdf
Saikumar174642
 
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdfGROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
kemimafe11
 
698642933-DdocfordownloadEEP-FAKE-PPT.pptx
698642933-DdocfordownloadEEP-FAKE-PPT.pptx698642933-DdocfordownloadEEP-FAKE-PPT.pptx
698642933-DdocfordownloadEEP-FAKE-PPT.pptx
speedcomcyber25
 
22PCOAM16 Unit 3 Session 23 Different ways to Combine Classifiers.pptx
22PCOAM16 Unit 3 Session 23  Different ways to Combine Classifiers.pptx22PCOAM16 Unit 3 Session 23  Different ways to Combine Classifiers.pptx
22PCOAM16 Unit 3 Session 23 Different ways to Combine Classifiers.pptx
Guru Nanak Technical Institutions
 
Working with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to ImplementationWorking with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to Implementation
Alabama Transportation Assistance Program
 
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
Guru Nanak Technical Institutions
 
Slide share PPT of NOx control technologies.pptx
Slide share PPT of  NOx control technologies.pptxSlide share PPT of  NOx control technologies.pptx
Slide share PPT of NOx control technologies.pptx
vvsasane
 
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
Pierre Celestin Eyock
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Construction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil EngineeringConstruction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil Engineering
Lavish Kashyap
 
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
IPC-7711D-7721D_ EN 2023 TOC Rework, Modification and Repair of Electronic As...
ssuserd9338b
 
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptxUnleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
SanjeetMishra29
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Introduction to Additive Manufacturing(3D printing)
Introduction to Additive Manufacturing(3D printing)Introduction to Additive Manufacturing(3D printing)
Introduction to Additive Manufacturing(3D printing)
vijimech408
 
Understand water laser communication using Arduino laser and solar panel
Understand water laser communication using Arduino laser and solar panelUnderstand water laser communication using Arduino laser and solar panel
Understand water laser communication using Arduino laser and solar panel
NaveenBotsa
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Zeiss-Ultra-Optimeter metrology subject.pdf
Zeiss-Ultra-Optimeter metrology subject.pdfZeiss-Ultra-Optimeter metrology subject.pdf
Zeiss-Ultra-Optimeter metrology subject.pdf
Saikumar174642
 
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdfGROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
GROUP 2 - MANUFACTURE OF LIME, GYPSUM AND CEMENT.pdf
kemimafe11
 
698642933-DdocfordownloadEEP-FAKE-PPT.pptx
698642933-DdocfordownloadEEP-FAKE-PPT.pptx698642933-DdocfordownloadEEP-FAKE-PPT.pptx
698642933-DdocfordownloadEEP-FAKE-PPT.pptx
speedcomcyber25
 
22PCOAM16 Unit 3 Session 23 Different ways to Combine Classifiers.pptx
22PCOAM16 Unit 3 Session 23  Different ways to Combine Classifiers.pptx22PCOAM16 Unit 3 Session 23  Different ways to Combine Classifiers.pptx
22PCOAM16 Unit 3 Session 23 Different ways to Combine Classifiers.pptx
Guru Nanak Technical Institutions
 
Slide share PPT of NOx control technologies.pptx
Slide share PPT of  NOx control technologies.pptxSlide share PPT of  NOx control technologies.pptx
Slide share PPT of NOx control technologies.pptx
vvsasane
 
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
860556374-10280271.pptx PETROLEUM COKE CALCINATION PLANT
Pierre Celestin Eyock
 
Ad

4 DM Clustering ifor computerscience.ppt

  • 1. Clustering • Clustering Quality • Partitioning clustering • Hierarchical clustering
  • 2. 2 Clustering • Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as close to each other as possible. • While data points in the same cluster are similar, those in separate clusters are dissimilar to one another. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x • Clustering is a data mining (machine learning) technique that finds similarities between data according to the characteristics found in the data & groups similar data objects into one cluster
  • 3. Example: clustering • The example below demonstrates the clustering of padlocks of same kind. There are a total of 10 padlocks which various in color, size, shape, etc. • How many possible clusters of padlocks can be identified? – There are three different kind of padlocks; which can be grouped into three different clusters. – The padlocks of same kind are clustered into a group as shown below:
  • 4. Example: Clustering Application • Text/Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: • Identify content-bearing terms in each document. • Form a similarity measure based on the frequencies of different terms and use it to cluster documents. – Application: • Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
  • 5. 5 Quality: What Is Good Clustering? • The quality of a clustering result depends on both the similarity measure used by the method and its implementation – Key requirement of clustering: Need a good measure of similarity between instances. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns in the given datasets • A good clustering method will produce high quality clusters with –high intra-class similarity –low inter-class similarity Inter Intra-cluster distances are minimized Inter-cluster distances are maximized
  • 6. 7 Cluster Evaluation: Hard Problem • The quality of a clustering is very hard to evaluate because – We do not know the correct clusters/classes • Some methods are used: – User inspection • Study centroids of the cluster, and spreads of data items in each cluster • For text documents, one can read some documents in clusters to evaluate the quality of clustering algorithms employed.
  • 7. 8 Cluster Evaluation: Ground Truth • We use some labeled data (for classification) – Assumption: Each class is a cluster. • After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score. – Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk.
  • 8. Evaluation of Cluster Quality using Purity • Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data • Assesses a clustering with respect to ground truth … requires labeled data • Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members • Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster ωi • Others are entropy of classes in clusters (or mutual information between classes and clusters) C j n n Purity ij j i i   ) ( max 1 ) (
  • 9.                  Cluster I Cluster II Cluster III • Assume that we cluster three category of data items (those colored with red, blue and green) into three clusters as shown in the above figures. Calculate purity to measure the quality of each cluster. Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83% Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67% Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60% Purity example
  • 10. 13 Indirect Evaluation • In some applications, clustering is not the primary task, but used to help perform another task. • We can use the performance on the primary task to compare clustering methods. • For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers. – If we can cluster books according to their features, we might be able to provide better recommendations. – We can evaluate different clustering algorithms based on how well they help with the recommendation task. – Here, we assume that the recommendation can be reliably evaluated.
  • 11. 14 Similarity/Dissimilarity Measures • Each clustering problem is based on some kind of “distance” or “nearness measurement” between data points. • Distances are normally used to measure the similarity or dissimilarity between two data objects • Popular similarity measure is: Minkowski distance: where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n- dimensional data objects; n is size of vector attributes of the data object; q= 1,2,3,… • If q = 1, dis is Manhattan distance q q i y i x n i Y X dis |) (| 1 ) , (     | (| ) , ( 1 i i n i y x Y X dis    
  • 12. 16 Similarity and Dissimilarity Between Objects • If q = 2, dis is Euclidean distance: • Cosine Similarity – If X and Y are two vector attributes of data objects, then cosine similarity measure is given by: where  indicates vector dot product, ||xi|| the length of vector d i y i x i y i x Y X dis    ) , ( 2 |) (| 1 ) , ( i y i x n i Y X dis    
  • 13. 17 Example: Similarity measure • Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1 d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)½ =(42)½ = 6.481 ||d2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)½ = (17)½ = 4.12 cos(d1 , d2 ) = 0.94
  • 14. The need for representative • Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? • For each cluster assign a centroid (closest to all other points)= average of its points. • Measure intercluster distances by distances of centroids. N C N i ip m C ) ( 1   
  • 15. 19 Major Clustering Approaches • Partitioning clustering approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods: • distance-based: K-means clustering • model-based: expectation maximization (EM) clustering. • Hierarchical clustering approach: – Create a hierarchical decomposition of the set of data (or objects) using some criterion – Typical methods: • agglomerative Vs divisive • single link Vs complete link
  • 16. 20 Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters; such that, sum of squared distance is minimum • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means: Each cluster is represented by the center of the cluster • K is the number of clusters to partition the dataset • Means refers to the average location of members of a particular cluster – k-medoids or PAM (Partition Around Medoids): Each cluster is represented by one of the objects in the cluster
  • 17. 21 The K-Means Clustering Method • Algorithm: • Select K cluster points as initial centroids (the initial centroids are selected randomly) – Given k, the k-means algorithm is implemented as follows: • Repeat –Partition objects into k nonempty subsets –Recompute the centroids of each K clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) –Assign each object to the cluster with the nearest seed point • Until the centroid don’t change
  • 18. 22 The K-Means Clustering Method • Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign
  • 19. Example Problem • Cluster the following eight points (with (x, y) representing locations) into three clusters : A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). – Assume that initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). • The distance function between two points a=(x1, y1) and b=(x2, y2) is defined as: dis(a, b) = |x2 – x1| + |y2 – y1| . • Use k-means algorithm to find optimal centroids to group the given data into three clusters.
  • 20. Iteration 1 (2,10) (5, 8) (1, 2) Point Mean 1 Mean 2 Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2 First we list all points in the first column of the table below. The initial cluster centers – centroids, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from each points to each of the three centroids, by using the distance function: dis(point i,mean j)=|x2 – x1| + |y2 – y1|
  • 21. Iteration 1 • Starting from point A1 calculate the distance to each of the three means, by using the distance function: dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0 dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5 dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9 – Fill these values in the table & decide which cluster should the point (2, 10) be placed in? The one, where the point has the shortest distance to the mean – i.e. mean 1 (cluster 1), since the distance is 0. • Next go to the second point A2 and calculate the distance: dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5 dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6 dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4 – So, we fill in these values in the table and assign the point (2, 5) to cluster 3 since mean 3 is the shortest distance from A2. • Analogically, we fill in the rest of the table, and place each point in one of the clusters
  • 22. Iteration 1 • Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster. • For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains the same. • For Cluster 2, we have five points and needs to take average of them as new centroid, i,e. ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6) • For Cluster 3, we have two points. The new centroid is: ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5) • That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until the centroids do not change anymore. – In Iteration2, we basically repeat the process from Iteration1 this time using the new means we computed.
  • 23. Second epoch • Using the new centroid we have to compute cluster members. • After the 2nd epoch the results would be: cluster 1: {A1,A8} with new centroid=(3,9.5); cluster 2: {A3,A4,A5,A6} with new centroid=(6.5,5.25); cluster 3: {A2,A7} with new centroid=(1.5,3.5) (2,10) (6, 6) (1.5, 3.5) Point Mean 1 Mean 2 Mean 3 Cluster A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
  • 24. Third epoch • Using the new centroid we have to compute cluster members. • After the 3rd epoch the results would be: cluster 1: {A1,A4,A8} with new centroid=(3.66,9); cluster 2: {A3,A5,A6} with new centroid=(7,4.33); cluster 3: {A2,A7} with new centroid=(1.5,3.5) (3,9.5) (6.5, 5.25) (1.5, 3.5) Point Mean 1 Mean 2 Mean 3 Cluster A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
  • 25. Fourth epoch • Using the new centroid we have to compute cluster members. • After the 3rd epoch the results would be: cluster 1: {A1,A4,A8} with new centroid=(3.66,9); cluster 2: {A3,A5,A6} with new centroid=(7,4.33); cluster 3: {A2,A7} with new centroid=(1.5,3.5) (3.66,9) (7, 4.33) (1.5, 3.5) Point Mean 1 Mean 2 Mean 3 Cluster A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
  • 26. Final results • Finally in the 4th epoch there is no change of members of clusters and centroids. So the algoithrm stops. • The result of clustering is shown in the following figure
  • 27. 31 Comments on the K-Means Method • Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Weakness –Applicable only when mean is defined, then what about categorical data? Use hierarchical clustering • Need to specify k, the number of clusters, in advance –Unable to handle noisy data and outliers Since an object with an extremely large value may substantially distort the distribution of the data. • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
  • 28. Hierarchical Clustering • As compared to partitioning algorithm, in hierarchical clustering the data are not partitioned into a particular cluster in a single step. –Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. • Produces a set of nested clusters organized as a hierarchical tree. –Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by partitioning clustering. –Can be visualized as a dendrogram; a tree like diagram that records the sequences of merges or splits 1 3 2 5 4 6 0 0.05 0.1 0.15 0.2 1 2 3 4 5 6 1 2 3 4 5
  • 29. 34 Dendrogram: Shows How the Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
  • 31. Two main types of hierarchical clustering • Agglomerative: it is a Bottom Up clustering technique –Start with all sample units in n clusters of size 1. –Then, at each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster. –The algorithm stops when all sample units are grouped into one cluster of size n. • Divisive: it is a Top Down clustering technique –Start with all sample units in a single cluster of size n. –Then, at each step of the algorithm, clusters are partitioned into a pair of daughter clusters, selected to maximize the distance between each daughter. –The algorithm stops when sample units are partitioned into n clusters of size 1. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive
  • 32. Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward 1. Let each data point be a cluster 2. Compute the proximity matrix 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains • Key operation is the computation of the proximity of two clusters 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 33. Example • Perform a agglomerative clustering of five samples using two features X and Y. Calculate Manhattan distance between each pair of samples to measure their similarity. • Assignment: apply divisive clustering of five samples given above. Calculate Manhattan distance between each pair of samples to measure their similarity/dissimilarity. Data item X Y 1 4 4 2 8 4 3 15 8 4 24 4 5 24 12
  • 34. Proximity Matrix: First epoch 1 2 3 4 5 1 = (4,4) 2= (8,4) 3=(15,8) 4=(24,4) 5=(24,12)
  • 35. Exercise: Hierarchical clustering • Using centroid apply agglomerative clustering algorithm to cluster the following 8 examples. Show the dendrograms. A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). • Also try to use single-link, complete-link, average-link agglomerative clustering to cluster the above given data?
  • 36. Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Editor's Notes

  • #17: 09/09/25MK: New example (previous one was from Tan’s book). Chapter 3: Statistics Methods Co-variance distance K-L divergence
  翻译: