This document provides an overview of cluster analysis and clustering algorithms. It defines cluster analysis as grouping objects such that objects within a group are similar to each other and different from objects in other groups. The document discusses different types of clusters, including well-separated, prototype-based, contiguity-based, and density-based clusters. It also covers hierarchical and partitional clustering. Finally, it describes the widely used k-means clustering algorithm and its objective function.
Ensemble techniques construct multiple base classifiers from training data and combine their predictions, often by taking a majority vote. This document discusses ensemble methods like bagging and boosting. Bagging generates training data for each base classifier by sampling with replacement from the original training set, while boosting iteratively adjusts the weights of misclassified examples to focus learning. Both aim to reduce prediction variance and improve accuracy over single classifiers.
Hierarchical clustering builds clusters hierarchically, by either merging or splitting clusters at each step. Agglomerative hierarchical clustering starts with each point as a separate cluster and successively merges the closest clusters based on a defined proximity measure between clusters. This results in a dendrogram showing the nested clustering structure. The basic algorithm computes a proximity matrix, then repeatedly merges the closest pair of clusters and updates the matrix until all points are in one cluster.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
The document discusses using cluster analysis techniques like K-Means and spectral clustering on NBA player statistics data. It begins by introducing machine learning concepts like supervised vs. unsupervised learning and definitions of clustering criteria. It then describes preprocessing the 27-dimensional player data into 2 dimensions using linear discriminant analysis (LDA) and principal component analysis (PCA) for visualization. K-Means clustering is applied to the LDA-reduced data, identifying distinct player groups. Spectral clustering will also be applied using PCA for comparison. The goal is to categorize players and determine the best athletes without prior basketball knowledge.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of machine learning techniques that can be applied in finance, including exploratory data analysis, clustering, classification, and regression methods. It discusses statistical learning approaches like data mining and modeling. For clustering, it describes techniques like k-means clustering, hierarchical clustering, Gaussian mixture models, and self-organizing maps. For classification, it mentions discriminant analysis, decision trees, neural networks, and support vector machines. It also provides summaries of regression, ensemble methods, and working with big data and distributed learning.
Pattern recognition binoy k means clustering108kaushik
This document discusses clustering and the k-means clustering algorithm. It defines clustering as grouping a set of data objects into clusters so that objects within the same cluster are similar to each other but dissimilar to objects in other clusters. The k-means algorithm is described as an iterative process that assigns each object to one of k predefined clusters based on the object's distance from the cluster's centroid, then recalculates the centroid, repeating until cluster assignments no longer change. A worked example demonstrates how k-means partitions 7 objects into 2 clusters over 3 iterations. The k-means algorithm is noted to be efficient but requires specifying k and can be impacted by outliers, noise, and non-convex cluster shapes.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
This document provides an overview of cluster analysis techniques. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. The document then discusses how cluster analysis can be used in marketing research for market segmentation, understanding consumer behaviors, and identifying new product opportunities. It outlines the typical steps to conduct a cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. Specific clustering methods like hierarchical, k-means, and deciding the number of clusters using the elbow rule are explained. The document concludes with an example of conducting a cluster analysis in SPSS.
The document discusses different clustering algorithms. It introduces hierarchical clustering and k-means clustering. For hierarchical clustering, it explains how to represent clusters and determine the distance between clusters. For k-means clustering, it describes the basic algorithm and approaches for initializing cluster centroids and selecting the optimal number of clusters k. It also introduces the BFR algorithm for clustering large datasets.
The document discusses K-means clustering, which partitions data into K clusters by minimizing total variance. It explains the K-means algorithm involves randomly selecting K initial centroids, assigning data points to the closest centroid, recalculating centroids as means of points in each cluster, and repeating until centroids do not change. The algorithm aims to minimize inertia by iteratively optimizing cluster centroids.
The document discusses K-means clustering, which partitions data into K clusters by minimizing total variance. It explains the K-means algorithm involves randomly selecting K initial centroids, assigning data points to the closest centroid, recalculating centroids as means of points in each cluster, and repeating until convergence. The algorithm aims to minimize overall variance by iterating through centroid calculations and reassignments.
tIt appears that you've provided a set of instructions or input format for a machine learning task, particularly clustering using K-Means. Let's break down what each component means:
(number of clusters):
This is a placeholder for an actual numerical value that represents the desired number of clusters into which you want to divide your training data. In K-Means clustering, you need to specify in advance how many clusters (K) you want the algorithm to find in your data.
Training set:
The "training set" is your dataset, which contains the data points that you want to cluster. Each data point represents an observation or sample in your dataset.
(drop convention):
It's not clear from this input what "(drop convention)" refers to. It could be related to a specific data preprocessing or handling instruction, but without additional context or information, it's challenging to provide a precise explanation for this part.
In summary, you are expected to provide the number of clusters (K) that you want to discover in your training data, and the training data itself contains the observations or samples that will be used for clustering. The "(drop convention)" part may require further clarification or context to provide a meaningful explanation.Clustering is a fundamental concept in the field of machine learning and data analysis that involves grouping similar data points together based on certain criteria or patterns. It is a technique used to discover inherent structures, relationships, or similarities within a dataset when there are no predefined labels or categories. Clustering is widely employed in various domains, including marketing, biology, image analysis, recommendation systems, and more. In this comprehensive explanation of clustering, we will explore its principles, methods, applications, and key considerations.
Table of Contents
Introduction to Clustering
Key Concepts and Terminology
Types of Clustering
3.1. Partitioning Clustering
3.2. Hierarchical Clustering
3.3. Density-Based Clustering
3.4. Model-Based Clustering
Distance Metrics and Similarity Measures
Common Clustering Algorithms
5.1. K-Means Clustering
5.2. Hierarchical Agglomerative Clustering
5.3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
5.4. Gaussian Mixture Models (GMM)
Evaluation of Clusters
Applications of Clustering
7.1. Customer Segmentation
7.2. Image Segmentation
7.3. Anomaly Detection
7.4. Document Clustering
7.5. Recommender Systems
7.6. Genomic Clustering
Challenges and Considerations
8.1. Determining the Number of Clusters (K)
8.2. Handling High-Dimensional Data
8.3. Initial Centroid Selection
8.4. Scaling and Normalization
8.5. Interpretation of Results
Best Practices in Clustering
Future Trends and Advances
Conclusion
1. Introduction to Clustering
Clustering, in the context of data analysis and machine learning, refers to the process of grouping a set of data points into subsets,
This document provides an overview of various machine learning algorithms and concepts, including supervised learning techniques like linear regression, logistic regression, decision trees, random forests, and support vector machines. It also discusses unsupervised learning methods like principal component analysis and kernel-based PCA. Key aspects of linear regression, logistic regression, and random forests are summarized, such as cost functions, gradient descent, sigmoid functions, and bagging. Kernel methods are also introduced, explaining how the kernel trick can allow solving non-linear problems by mapping data to a higher-dimensional feature space.
This document provides information about clustering and cluster analysis. It begins by defining clustering as the process of grouping objects into classes of similar objects. It then discusses what a cluster is and different types of clustering techniques, including partitioning methods like k-means clustering. K-means clustering is explained as an algorithm that assigns objects to clusters based on minimizing distance between objects and cluster centers, then updating the cluster centers. Examples are provided to demonstrate how k-means clustering works on a sample dataset.
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
The document discusses using cluster analysis techniques like K-Means and spectral clustering on NBA player statistics data. It begins by introducing machine learning concepts like supervised vs. unsupervised learning and definitions of clustering criteria. It then describes preprocessing the 27-dimensional player data into 2 dimensions using linear discriminant analysis (LDA) and principal component analysis (PCA) for visualization. K-Means clustering is applied to the LDA-reduced data, identifying distinct player groups. Spectral clustering will also be applied using PCA for comparison. The goal is to categorize players and determine the best athletes without prior basketball knowledge.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of machine learning techniques that can be applied in finance, including exploratory data analysis, clustering, classification, and regression methods. It discusses statistical learning approaches like data mining and modeling. For clustering, it describes techniques like k-means clustering, hierarchical clustering, Gaussian mixture models, and self-organizing maps. For classification, it mentions discriminant analysis, decision trees, neural networks, and support vector machines. It also provides summaries of regression, ensemble methods, and working with big data and distributed learning.
Pattern recognition binoy k means clustering108kaushik
This document discusses clustering and the k-means clustering algorithm. It defines clustering as grouping a set of data objects into clusters so that objects within the same cluster are similar to each other but dissimilar to objects in other clusters. The k-means algorithm is described as an iterative process that assigns each object to one of k predefined clusters based on the object's distance from the cluster's centroid, then recalculates the centroid, repeating until cluster assignments no longer change. A worked example demonstrates how k-means partitions 7 objects into 2 clusters over 3 iterations. The k-means algorithm is noted to be efficient but requires specifying k and can be impacted by outliers, noise, and non-convex cluster shapes.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
This document provides an overview of cluster analysis techniques. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. The document then discusses how cluster analysis can be used in marketing research for market segmentation, understanding consumer behaviors, and identifying new product opportunities. It outlines the typical steps to conduct a cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. Specific clustering methods like hierarchical, k-means, and deciding the number of clusters using the elbow rule are explained. The document concludes with an example of conducting a cluster analysis in SPSS.
The document discusses different clustering algorithms. It introduces hierarchical clustering and k-means clustering. For hierarchical clustering, it explains how to represent clusters and determine the distance between clusters. For k-means clustering, it describes the basic algorithm and approaches for initializing cluster centroids and selecting the optimal number of clusters k. It also introduces the BFR algorithm for clustering large datasets.
The document discusses K-means clustering, which partitions data into K clusters by minimizing total variance. It explains the K-means algorithm involves randomly selecting K initial centroids, assigning data points to the closest centroid, recalculating centroids as means of points in each cluster, and repeating until centroids do not change. The algorithm aims to minimize inertia by iteratively optimizing cluster centroids.
The document discusses K-means clustering, which partitions data into K clusters by minimizing total variance. It explains the K-means algorithm involves randomly selecting K initial centroids, assigning data points to the closest centroid, recalculating centroids as means of points in each cluster, and repeating until convergence. The algorithm aims to minimize overall variance by iterating through centroid calculations and reassignments.
tIt appears that you've provided a set of instructions or input format for a machine learning task, particularly clustering using K-Means. Let's break down what each component means:
(number of clusters):
This is a placeholder for an actual numerical value that represents the desired number of clusters into which you want to divide your training data. In K-Means clustering, you need to specify in advance how many clusters (K) you want the algorithm to find in your data.
Training set:
The "training set" is your dataset, which contains the data points that you want to cluster. Each data point represents an observation or sample in your dataset.
(drop convention):
It's not clear from this input what "(drop convention)" refers to. It could be related to a specific data preprocessing or handling instruction, but without additional context or information, it's challenging to provide a precise explanation for this part.
In summary, you are expected to provide the number of clusters (K) that you want to discover in your training data, and the training data itself contains the observations or samples that will be used for clustering. The "(drop convention)" part may require further clarification or context to provide a meaningful explanation.Clustering is a fundamental concept in the field of machine learning and data analysis that involves grouping similar data points together based on certain criteria or patterns. It is a technique used to discover inherent structures, relationships, or similarities within a dataset when there are no predefined labels or categories. Clustering is widely employed in various domains, including marketing, biology, image analysis, recommendation systems, and more. In this comprehensive explanation of clustering, we will explore its principles, methods, applications, and key considerations.
Table of Contents
Introduction to Clustering
Key Concepts and Terminology
Types of Clustering
3.1. Partitioning Clustering
3.2. Hierarchical Clustering
3.3. Density-Based Clustering
3.4. Model-Based Clustering
Distance Metrics and Similarity Measures
Common Clustering Algorithms
5.1. K-Means Clustering
5.2. Hierarchical Agglomerative Clustering
5.3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
5.4. Gaussian Mixture Models (GMM)
Evaluation of Clusters
Applications of Clustering
7.1. Customer Segmentation
7.2. Image Segmentation
7.3. Anomaly Detection
7.4. Document Clustering
7.5. Recommender Systems
7.6. Genomic Clustering
Challenges and Considerations
8.1. Determining the Number of Clusters (K)
8.2. Handling High-Dimensional Data
8.3. Initial Centroid Selection
8.4. Scaling and Normalization
8.5. Interpretation of Results
Best Practices in Clustering
Future Trends and Advances
Conclusion
1. Introduction to Clustering
Clustering, in the context of data analysis and machine learning, refers to the process of grouping a set of data points into subsets,
This document provides an overview of various machine learning algorithms and concepts, including supervised learning techniques like linear regression, logistic regression, decision trees, random forests, and support vector machines. It also discusses unsupervised learning methods like principal component analysis and kernel-based PCA. Key aspects of linear regression, logistic regression, and random forests are summarized, such as cost functions, gradient descent, sigmoid functions, and bagging. Kernel methods are also introduced, explaining how the kernel trick can allow solving non-linear problems by mapping data to a higher-dimensional feature space.
This document provides information about clustering and cluster analysis. It begins by defining clustering as the process of grouping objects into classes of similar objects. It then discusses what a cluster is and different types of clustering techniques, including partitioning methods like k-means clustering. K-means clustering is explained as an algorithm that assigns objects to clusters based on minimizing distance between objects and cluster centers, then updating the cluster centers. Examples are provided to demonstrate how k-means clustering works on a sample dataset.
Welcome to the May 2025 edition of WIPAC Monthly celebrating the 14th anniversary of the WIPAC Group and WIPAC monthly.
In this edition along with the usual news from around the industry we have three great articles for your contemplation
Firstly from Michael Dooley we have a feature article about ammonia ion selective electrodes and their online applications
Secondly we have an article from myself which highlights the increasing amount of wastewater monitoring and asks "what is the overall" strategy or are we installing monitoring for the sake of monitoring
Lastly we have an article on data as a service for resilient utility operations and how it can be used effectively.
この資料は、Roy FieldingのREST論文(第5章)を振り返り、現代Webで誤解されがちなRESTの本質を解説しています。特に、ハイパーメディア制御やアプリケーション状態の管理に関する重要なポイントをわかりやすく紹介しています。
This presentation revisits Chapter 5 of Roy Fielding's PhD dissertation on REST, clarifying concepts that are often misunderstood in modern web design—such as hypermedia controls within representations and the role of hypermedia in managing application state.
AI-Powered Data Management and Governance in RetailIJDKP
Artificial intelligence (AI) is transforming the retail industry’s approach to data management and decisionmaking. This journal explores how AI-powered techniques enhance data governance in retail, ensuring data quality, security, and compliance in an era of big data and real-time analytics. We review the current landscape of AI adoption in retail, underscoring the need for robust data governance frameworks to handle the influx of data and support AI initiatives. Drawing on literature and industry examples, we examine established data governance frameworks and how AI technologies (such as machine learning and automation) are augmenting traditional data management practices. Key applications are identified, including AI-driven data quality improvement, automated metadata management, and intelligent data lineage tracking, illustrating how these innovations streamline operations and maintain data integrity. Ethical considerations including customer privacy, bias mitigation, transparency, and regulatory compliance are discussed to address the challenges of deploying AI in data governance responsibly.
Design of Variable Depth Single-Span Post.pdfKamel Farid
Hunched Single Span Bridge: -
(HSSBs) have maximum depth at ends and minimum depth at midspan.
Used for long-span river crossings or highway overpasses when:
Aesthetically pleasing shape is required or
Vertical clearance needs to be maximized
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...ijdmsjournal
Agile methodologies have transformed organizational management by prioritizing team autonomy and
iterative learning cycles. However, these approaches often lack structured mechanisms for knowledge
retention and interoperability, leading to fragmented decision-making, information silos, and strategic
misalignment. This study proposes an alternative approach to knowledge management in Agile
environments by integrating Ikujiro Nonaka and Hirotaka Takeuchi’s theory of knowledge creation—
specifically the concept of Ba, a shared space where knowledge is created and validated—with Jürgen
Habermas’s Theory of Communicative Action, which emphasizes deliberation as the foundation for trust
and legitimacy in organizational decision-making. To operationalize this integration, we propose the
Deliberative Permeability Metric (DPM), a diagnostic tool that evaluates knowledge flow and the
deliberative foundation of organizational decisions, and the Communicative Rationality Cycle (CRC), a
structured feedback model that extends the DPM, ensuring long-term adaptability and data governance.
This model was applied at Livelo, a Brazilian loyalty program company, demonstrating that structured
deliberation improves operational efficiency and reduces knowledge fragmentation. The findings indicate
that institutionalizing deliberative processes strengthens knowledge interoperability, fostering a more
resilient and adaptive approach to data governance in complex organizations.
2. 2
Supervised learning vs.
unsupervised learning
Supervised learning: discover patterns in the data
that relate data attributes with a target (class)
attribute.
These patterns are then utilized to predict the values
of the target attribute in future data instances
Unsupervised learning: The data have no target
attribute.
We want to explore the data to find some intrinsic
structures in them.
3. 3
What is Cluster Analysis?
Finding groups of objects in data such that the
objects in a group will be similar (or related) to
one another and different from (or unrelated to)
the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
4. 4
Applications of Cluster Analysis
Understanding
Group related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations
Summarization
Reduce the size of large
data sets
Discovered Clusters Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Technology1-DOWN
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
Clustering precipitation
in Australia
5. 5
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
9. 9
K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X Rr, and r is the number of attributes
(dimensions) in the data.
The k-means algorithm partitions the given data
into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
11. 11
Stopping/convergence criterion
1. no (or minimum) re-assignments of data points
to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error
(SSE),
Ci is the jth cluster, mj is the centroid of cluster Cj (the
mean vector of all the data points in Cj), and dist(x,
mj) is the distance between data point x and centroid
mj.
k
j
C j
j
dist
SSE
1
2
)
,
(
x
m
x
12. 12
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in the first few
iterations.
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
13. 13
Two different K-means Clusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
14. 14
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
15. 15
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
16. 16
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K
k
j
C j
j
dist
SSE
1
2
)
,
(
x
m
x
17. 17
Importance of Choosing Initial
Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
18. 18
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
19. 19
Problems with Selecting Initial Points
If there are K ‘real’ clusters then the chance of selecting
one centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 =
0.00036
Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
Consider an example of five pairs of clusters
20. 20
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
21. 21
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
22. 22
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
23. 23
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
24. 24
Solutions to Initial Centroids
Problem
Multiple runs
Helps, but probability is not on your side
Sample and use hierarchical clustering to
determine initial centroids
Select more than k initial centroids and then
select among these initial centroids
Select most widely separated
Postprocessing
Bisecting K-means
25. 25
Pre-processing and Post-processing
Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
Merge clusters that are ‘close’ and that have relatively
low SSE
Can use these steps during the clustering process
ISODATA
26. 26
Limitations of K-means
K-means has problems when clusters are of
differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains
outliers.
33. 33
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
34. 34
Strengths of Hierarchical
Clustering
Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
They may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
35. 35
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Traditional hierarchical algorithms use a similarity or
distance matrix
Merge or split one cluster at a time
36. 36
Agglomerative Clustering Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of
two clusters
Different approaches to defining the distance between
clusters distinguish the different algorithms
38. 38
Intermediate Situation
After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
39. 39
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
40. 40
After Merging
The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5
C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
41. 41
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
Proximity Matrix
42. 42
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
43. 43
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
44. 44
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
45. 45
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
46. 46
Cluster Similarity: MIN or Single
Link
Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
Determined by one pair of points, i.e., by one link in
the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
50. 50
Cluster Similarity: MAX or Complete
Linkage
Similarity of two clusters is based on the two
least similar (most distant) points in the different
clusters
Determined by all pairs of points in the two clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
52. 52
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters (globular -- küresel)
53. 53
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
Need to use average connectivity for scalability since total
proximity favors large clusters
|
|Cluster
|
|Cluster
)
p
,
p
proximity(
)
Cluster
,
Cluster
proximity(
j
i
Cluster
p
Cluster
p
j
i
j
i
j
j
i
i
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
55. 55
Hierarchical Clustering: Group
Average
Compromise between Single and Complete
Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular (küresel) clusters
56. 56
Cluster Similarity: Ward’s Method
Similarity of two clusters is based on the increase
in squared error when two clusters are merged
Similar to group average if distance between points is
distance squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means
57. 57
Cluster Validity
For supervised classification we have a variety of
measures to evaluate how good our model is
Accuracy, precision, recall
For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
58. 58
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random
Points
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete
Link
59. 59
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate
the entire clustering or just individual clusters.
Different Aspects of Cluster Validation
60. 60
Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
Entropy
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
Sum of Squared Error (SSE)
Relative Index: Used to compare two different clusterings or
clusters.
Often an external or internal index is used for this function, e.g., SSE or
entropy
Sometimes these are referred to as criteria instead of indices
However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Measures of Cluster Validity
61. 61
Two matrices
Proximity Matrix (Yakınlık matrisi)
“Incidence” Matrix (Tekrar Oranı Matrisi)
One row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters
Compute the correlation between the two matrices
Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
High correlation indicates that points that belong to the
same cluster are close to each other.
Not a good measure for some density or contiguity based
clusters.
Measuring Cluster Validity Via
Correlation
62. 62
Measuring Cluster Validity Via
Correlation
Correlation of incidence and proximity matrices
for the K-means clusterings of the following two
data sets.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
63. 63
Order the similarity matrix with respect to cluster
labels and inspect visually.
Using Similarity Matrix for Cluster Validation
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
64. 64
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DBSCAN
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
65. 65
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
K-means
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
66. 66
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Complete Link
68. 68
Cluster Cohesion: Measures how closely related
are objects in a cluster
Example: SSE
Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
Example: Squared Error
Cohesion is measured by the within cluster sum of squares (SSE)
Separation is measured by the between cluster sum of squares
Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
i C
x
i
i
m
x
WSS 2
)
(
i
i
i m
m
C
BSS 2
)
(
70. 70
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number of clusters
Internal Measures: SSE
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
71. 71
Internal Measures: SSE
SSE curve for a more complicated data set
1
2
3
5
6
4
7
SSE of clusters found using K-means
72. 72
Need a framework to interpret any measure.
For example, if our measure of evaluation has the value, 10, is that
good, fair, or poor?
Statistics provide a framework for cluster validity
The more “atypical” a clustering result is, the more likely it represents
valid structure in the data
Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
If the value of the index is unlikely, then the cluster results are valid
These approaches are more complicated and harder to understand.
For comparing the results of two different sets of cluster
analyses, a framework is less necessary.
However, there is the question of whether the difference between
two index values is significant
Framework for Cluster Validity
73. 73
Example
Compare SSE of 0.005 against three clusters in random data
Histogram shows SSE of three clusters in 500 sets of random data
points of size 100 distributed over the range 0.2 – 0.8 for x and y
values
Statistical Framework for SSE
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
0
5
10
15
20
25
30
35
40
45
50
SSE
Count
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
74. 74
Correlation of incidence and proximity matrices for the
K-means clusterings of the following two data sets.
Statistical Framework for Correlation
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810