SlideShare a Scribd company logo
Big data Clustering
Algorithms & Strategies
FARZAD NOZARIAN
AMIRKABIR UNIVERSITY OF TECHNOLOGY – MARCH 2015
1
Preprocessing
Goals:
1. To assure the quality of the data by reducing the noisy and irrelevant information that it
could contain
2. To reduce the size of the dataset, so the computational cost of the discovery task is also
reduced.
Reducing the size of dataset:
◦ Number of instances
◦ addressed by sampling (the sampled dataset should holds the same information that the whole dataset)
◦ Dimensionality reduction
◦ Feature selection
◦ Feature extraction
2
Clustering algorithms
Hierarchical methods
◦ Divisive
◦ Agglomerative
Based on similarity matrix for each pair of examples
Some algorithm consider this matrix as Graph;
Other algorithm reduce the matrix each iteration by merging two groups.
The main drawback of these algorithms is their computational cost. (o(n2))
Scanning the dataset many times!
3
Prototype/model based clustering
Prototype and model based clustering assume that clusters fit to a specific shape.
Goal: Discover how different numbers of these shapes can explain the spatial distribution of
the data.
Must used prototype based clustering is K-Means.
◦ K-Means assumes that clusters are defined by their center (the prototype) and have spherical shapes.
◦ To feet this shape K-Means minimizing the distances from the examples to these centers.
◦ solved iteratively using a gradient descent algorithm.
4
Density based clustering
DBSCAN
OPTICS is an extension of the original DBSCAN that uses heuristics to find good values for
DBSCAN parameters.
The main drawback of this methods comes from the cost of finding the nearest neighbors for
an example.
Indexing is a solution, but may be degraded with the number of dimensions to a linear search.
5
Grid based clustering
The basic idea: divide the space of instances in hyperrectangular cells by discretizing the
attributes of the dataset.
Clusters of arbitrary shapes.
Each cell is summarized by the sufficient statistics of the examples it contains.
Usually scale well, but it depends on the granularity of the discretization of the space of
examples.
The strategies used to prune the search space allow to largely reduce the computational cost
6
Scalability strategies
One-pass strategies
Summarization strategies
Sampling/batch strategies
Approximation strategies
Divide and conquer strategies
7
One-pass strategies
Reduce the number of scans of the data to only one.
This constraint may be usually forced by the circumstance that the dataset can not fit in
memory and it has to be obtained from disk.
This is used to perform a preprocess of the dataset.
This results in two stages algorithms, a first one that applies the one-pass strategy and a
second one that process in memory a summary of the data obtained by the first stage.
8
Summarization Strategies
Purpose: obtain a coarse approximation of the data without losing the information that
represent the different densities of examples.
Sufficient statistics like mean and variance.
The summarization can be performed single level, as a preprocess that is feed to a cluster
algorithm.
9
Sampling/batch strategies
Purpose: Allow to perform the processing in main memory for a part of the dataset.
In case of more than one sample of the data: The algorithm should be able to process raw data
and cluster summaries.
They scale on the size of the sampling and not on the size of the whole dataset.
The use of batches assume that the data can be processed sequentially and that after applying
a clustering algorithm to a batch, the result can be merged with the results from previous
batches.
Data stream!
10
Approximation strategies
These strategies assume that some computations can be saved or approximated with reduced
or null impact on the final result.
Algorithm dependent.
Most costly part of clustering algorithms corresponds to distance computation among
instances or among instances and prototypes.
E.g, some of these algorithms are iterative and the decision about what partition is assigned to
an example does not change after a few iterations. If this can be determined at an early stage, all
these distance computations can be avoided in successive iterations.
This strategy is usually combined with a summarization strategy where groups of examples are
reduced to a point that is used to decide if the decision can be performed using only that point
or the distances to all the examples have to be computed.
11
Divide and conquer strategies
Data can be divided in multiple independent datasets and that the clustering results can be
then merged on a final model.
12
Hierarchical Algorithms
13
PINK: A Scalable Algorithm for Single-Linkage Hierarchical Clustering on
Distributed-Memory Architectures (2013) (northwestern)
A scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a
problem instance into two different types of subproblems.
As PINK does not explicitly store a distance matrix, it can be applied to much larger problem
sizes.
Algorithm:
◦ Divide a large hierarchical clustering problem instance into a set of smaller sub-problems
◦ Calculate the hierarchical clustering dendrogram for each of these sub-problems
◦ Reconstruct the solution for the original dataset by combining the solutions to the sub-problems.
14
Leader-single-link (l-SL): A distance based clustering
method for arbitrary shaped clusters in large datasets (2011)
Divides the clustering process in two steps:
◦ One pass clustering algorithm: resulting in a set of cluster summaries that reduce the size of the
dataset.
◦ This new dataset fits in memory and can be processed using a single link hierarchical clustering
algorithm.
Leaders clustering method: is a single data-scan distance based partitional clustering method.
For a given threshold distance τ, it produces a set of leaders L incrementally. For each pattern
𝑥, if there is a leader 𝑙 ϵ L such that 𝑥 − 𝑙 ≤ 𝜏, then 𝑥 is assigned to a cluster represented by 𝑙.
If there is no such leader, then 𝑥 becomes a new leader.
15
One-passSummarization
Leader-single-link (l-SL) (cont.)
The k-means also is a leader algorithm but it is applicable to numerical dataset only and scans
dataset more than once before convergence.
After producing the leaders, the leaders set is further clustered using SL method with cut-off
distance ℎ which results in clustering of leaders.
Finally, each leader is replaced by its followers to produce final clustering.
16
Density Based Algorithms
17
PDBSCAN
1. Divide the input into several partitions, and distribute these partitions to the available
computers
2. Cluster partitions concurrently using DBSCAN
3. Combine or merge the clustering's of the partitions into a clustering of the whole database.
4. In distributed environment we should care about data placement:
◦ Load balancing: the partitions should be almost of equal size if we assume that all computers have the
same performance
◦ Minimized communication cost: should avoid accessing those data located on any of the other
computers
◦ Distributed data access: This is not applicable for MR!
18
DivideandconquerdR*-tree
PDBSCAN (Cont.)
Algorithm is based on the R*-tree, provides not only a spatial data placement strategy for
clustering, but also efficient access to spatial data in a shared nothing architecture through the
replication of indices.
Proposed data placement solution: grouping the MBRs of leaf nodes of the R*-tree into N
partitions such that the nearby MBRs should be assigned to the same partition and the
partitions should be almost of equal size with respect to the number of MBRs.
How this solution can be achieved? use space filling Hilbert curves
For a given R*-tree, this method works as follows:
◦ Every data page of the R*-tree is assigned to a Hilbert value according to its center of gravity. So,
successive data pages will be close in space.
◦ Sort the list of pairs by ascending Hilbert values.
◦ If the R*-tree has d data pages and we have n slaves, every slave obtains d/n data pages of the sorted
list
19
PDBSCAN (Cont.)
Proposed efficient access to the distributed data solution: replicate the directory of the R*-tree
on all available computers (dR*-tree)
Now the PDBSCAN algorithm:
◦ Starts with an arbitrary point p within S and retrieves all points which are density-reachable from p
◦ If p is not a core point, no points are density-reachable from p: visits the next point in partition S
◦ If all members of C are contained in S: C is also a cluster
◦ If there are members of C outside of S: C may need to be merged with another cluster found call C a
merging candidate
20
PDBSCAN (Cont.)
The master PDBSCAN receives a list of merging candidates from every SLAVE.
PDBSCAN collects all the lists L it receives and assigns them to a list LL.
A merging function is noting else a nested loop that check for each pair of cluster if their
intersection aren’t empty!
21
MR-DBSCAN (2011)
Implement it by a 4-stages MapReduce paradigm.
Contributions: quick partitioning strategy for large scale non-indexed data.
Challenges of designing DBSCAN in MapReduce:
◦ Data interchange mechanism is limited. Data transferring between map and reduce is not encouraged.
◦ MapReduce doesn’t provide any mechanism such as R-tree, KD-tree to improve multidimensional search.
◦ Maximum parallelism can be achieved when the data is well balanced.
PDBSCAN was has been the basis of their work.
However it aggregate intermediate results in a single node, and MR-DBSCAN optimize this
issue.
22
GridMR
MR-DBSCAN (2011) (Cont.)
Stage 1: Preprocessing:
◦ Main challenges for a partitioning strategy are:
◦ Load balancing
◦ Minimize communication or shuffling cost (all related records, including the data within space Si and
its halo replication from bordering spaces, should easily map to a same key and be shuffled to target
reducer)
◦ What is the problem of spatial index? (disadvantages of indexing in MapReduce)
◦ Most of them are required to do iteration recursion to get a hierarchical structure that is not practical
in MapReduce. (BUT WHAT ABOUT SPARK?!)
◦ For large scale data its hierarchical index could reach one tenth of its original data size, which in huge
and hard to handle.
Proposed solution: grid file (divide the data domain in dimension i into mi portions, each of
which is considered as a mini bucket.)
23
MR-DBSCAN (2011) (Cont.)
Stage 2: Local DBSCAN :
◦ In PDBSCAN each thread could access not just its partition data but global data
during the processing of local DBSCAN algorithm. !!BAD in MapReduce!!
The local DBSCAN algorithm will only scan data and extend core points
within space Si.
24
When the cluster scan extends outside Si, assumed that a record q outside Si is directly-density-
reachable from a core point p in Si, we will not detect whether q is a core point anymore.
q will be marked as ‘On-queue’ status and put into Merge Candidates set (MC set) with core point p
as well.
MR-DBSCAN (2011) (Cont.)
Stage 3: Find Merging Mapping:
They optimized single node aggregation bottleneck in this section!
In PDBSCAN to merge the cluster from different subspaces:
◦ Collect the entire MC into a big list LL
◦ Among all the points in the list, execute a nested loop to find out whether two item with a
same point id are from different clusters.
◦ If found, merging the cluster.
25
MR-DBSCAN (2011) (Cont.)
Stage 4: Merge
Stage 4.1: Build Global Mapping:
We get several id lists of clusters to be merged for each two bordering space. (i, c1)<->(i+1, c2)
The output of this section is the mapping ((gridID, localclusterID), globalclusterID) for each local
cluster in each partition.
Stage 4.2: Merge and Relabel:
The final stage of algorithm is streaming all the local clustered records over the map-reduce
process and replacing their local cluster id with a new global cluster id (gid) based on the
mapping profile from Stage 4.1.
26
DBCURE (2014)
DBCURE utilizes ellipsoidal τ-neighborhoods instead of spherical ε-neighborhoods and has a
desirable property of being less sensitive to density parameters.
DBCURE is more suitable than OPTICS for being parallelized with MapReduce since the
ellipsoidal τ-neighborhood of each point can be determined in parallel.
User R*-tree efficiently to find the ellipsoidal τ-neighborhoods of a given point.
27
R*-treeIndexingMRGrid
Partitioning Algorithms
28
K-Means Algorithms
Its popularity can be attributed to several reasons:
1. It is conceptually simple and easy to implement.
2. It is versatile, i.e., almost every aspect of the algorithm (initialization, distance function,
termination criterion, etc.) can be modified. (This is evidenced by hundreds of
publications over the last fifty years that extend k-means in a variety of ways.)
3. It has a time complexity that is linear in N, D, and K (in general, D ≪ N and K ≪ N)
4. It has a storage complexity that is linear in N, D, and K
5. It is guaranteed to converge at a quadratic rate
6. It is invariant to data ordering, i.e., random shuffling of the data points (MapReduce
balance!)
29
K-Means Algorithms (Cont.)
k-means has several significant disadvantages:
1. It requires the number of clusters, K, to be specified in advance.
◦ Can be determined automatically by means of various internal/relative cluster validity
measures.
2. It can only detect compact, hyper spherical clusters that are well separated.
◦ Can be alleviated by using a more general distance function such as the Mahalanobis distance,
which permits the detection of hyper ellipsoidal clusters.
3. It is sensitive to noise and outlier points.
◦ Can be addressed by outlier pruning or by using a more robust distance function such as the
city-block (ℓ1) distance.
4. It often converges to a local minimum of the criterion function.
◦ For the same reason, it is highly sensitive to the selection of the initial centers
30
K-Means Algorithms (Cont.)
The Obstacles of Very Large Datasets Clustering Using K-Means:
◦ computational complexity of distance calculations;
◦ The number of iterations which significantly increases when the number of sample data increases.
Proposed idea to solve these obstacles:
◦ Solved by using MapReduce model to distribute computations
◦ Solved by using two-stages K-Means algorithm or K-Means++ algorithm
31
K-Medoids
Both K-Means and K-Medoids attempt to minimize the distance between points labeled to be
in a cluster and a point designated as the center of that cluster.
K-Medoids chooses data points as centers (medoids or exemplars) and works with an arbitrary
matrix of distances between data points instead of 𝜄2
32
PAM: Partitioning Around Medoids
1. Initialize: randomly select k of the n data points as the medoids
2. Associate each data point to the closest medoid. ("closest" here is defined using any
valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)
3. For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. Repeat steps 2 to 4 until there is no change in the medoid
33
CLARA/CLARANS
Reduces the number of medoids' calculations through sampling.
A small portion of data is firstly selected from the whole datasets and then PAM is used to
search the cluster medoids
34
Sampling
Fast clustering using MapReduce (2011, KDD)
K-center: the goal is to choose the centers such that the maximum distance between a center and a point
assigned to it is minimized.
K-median: It is a variation of k-means clustering where instead of calculating the mean for each cluster
to determine its centroid, one instead calculates the median. (the 1-norm distance metric, as opposed to the
square of the 2-norm)
Assume that the input is a weighted complete graph G = (V;E) that has an edge xy between any two
points in V , and the weight of the edge xy is d(x; y)
First idea: Adoption of existing algorithms to MR:
◦ Partition input across machines
◦ Each machine perform computation to sparsify data
◦ Results are collected in single machine and perform computation and final solution.
Unfortunately the total running time of the algorithm can be quite large:
It runs costly clustering algorithm on Ω(𝑘 𝑛)
35
ParallelSamplingMR
Fast clustering using MapReduce (2011, KDD) (Cont.)
This algorithm uses Iterative-Sample as a subroutine:
◦ Performs the following computation in parallel across the machines:
◦ In each round, it adds a small sample of points to the final sample, it determines which points are “well
represented” by the sample, and it recursively considersonly the points that are not well represented
After a good/strong sampling, they put the sampled points on a single machine and run a clustering
algorithm on just the sampled points.
They also describe about 3 page about their mathematical proof of their good iterative sampling
algorithm.
36
PK-Means: Parallel K-Means Clustering Based on MapReduce
(2099)
Map function: Assign each sample to the closest center
Reduce function: Performs the procedure of updating the new centers.
Combiner function: Deal with partial combination of the intermediate values with the same
key within the same map task
37
MR
PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.)
38
PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.)
39
PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.)
40
FMR.K-Means: Fast K-Means Clustering for Very Large Datasets Based on
MapReduce Combined with a New Cutting Method (2015)
Presents a new approach for reducing the number of iterations of K-Means algorithm
Based on Parallel K-Means based on the MapReduce.
Propose a new method called cutting off the last iterations based on differences between
centers of each cluster of two adjacent iterations.
41
MRIterationElimination
Canopy Clustering (KDD 2000)
Canopy works with datasets that either:
◦ Having millions of data points
◦ Thousands of dimensions
◦ Thousands of clusters
Key idea: Using a cheap, approximate distance measure to efficiently divide the data into
overlapping subsets (Canopies), then clustering is performed by measuring exact distances only
between points that occur in a common canopy.
Use domain-specific features in order to design a cheap distance metric and efficiently create
canopies using the metric.
A fast distance metrics for text used by search engines are based on the inverted index.
42
ApproximationTwo-stage
Fuzzy C-Means (FCM)
Given a finite set of data, the algorithm returns a list of c cluster centers and a partition matrix,
where each element of matrix tells the degree to which element xi belongs to cluster ci.
Like the k-means algorithm, the FCM aims to minimize an objective function:
This differs from the k-means objective function by the addition of the membership values wij
and the fuzzifier m.
The fuzzifier m determines the level of cluster fuzziness.
43
K-Means + Canopy: An Integrated Clustering Framework
Using Optimized K-means with Firefly and Canopies (2015)
Proposed by integration of two meta-heuristic algorithms: Firefly algorithm and Canopy
44
ApproximationTwo-stage
K-medoids Clustering Based on MapReduce and
Optimal Search of Medoids (2014)
Proposed an improved algorithm based on MapReduce and optimal search of medoids.
According to the basic properties of triangular geometry, this paper reduced calculation of
distances among data elements to help search medoids quickly and reduce the calculation
complexity of k-medoids.
45
MROptimalSearch
Ad

More Related Content

What's hot (20)

K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
JebaRaj26
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Pabna University of Science & Technology
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Customer segmentation.pptx
Customer segmentation.pptxCustomer segmentation.pptx
Customer segmentation.pptx
Addalashashikumar
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
Krish_ver2
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Fraboni Ec
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
sreedevibalasubraman
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
vineeta vineeta
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
Kudos S.A.S
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
JebaRaj26
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
Krish_ver2
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Fraboni Ec
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
vineeta vineeta
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
Kudos S.A.S
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 

Similar to Big data Clustering Algorithms And Strategies (20)

An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
4NM20IS025BHUSHANNAY
 
clustering using different methods in .pdf
clustering using different methods in .pdfclustering using different methods in .pdf
clustering using different methods in .pdf
officialnovice7
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
Janani C
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
MayuraD1
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
50120140505013
5012014050501350120140505013
50120140505013
IAEME Publication
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
IJCSIS Research Publications
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
cscpconf
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
4NM20IS025BHUSHANNAY
 
clustering using different methods in .pdf
clustering using different methods in .pdfclustering using different methods in .pdf
clustering using different methods in .pdf
officialnovice7
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
Janani C
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
IJCSIS Research Publications
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
cscpconf
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
Ad

More from Farzad Nozarian (14)

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL Databases
Farzad Nozarian
 
Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
Farzad Nozarian
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In Robotics
Farzad Nozarian
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engine
Farzad Nozarian
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring Model
Farzad Nozarian
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
Farzad Nozarian
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
Farzad Nozarian
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
Farzad Nozarian
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL Databases
Farzad Nozarian
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In Robotics
Farzad Nozarian
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engine
Farzad Nozarian
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring Model
Farzad Nozarian
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
Farzad Nozarian
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
Ad

Recently uploaded (20)

GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEMGDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
philipnathen82
 
Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025
GrapesTech Solutions
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Exchange Migration Tool- Shoviv Software
Exchange Migration Tool- Shoviv SoftwareExchange Migration Tool- Shoviv Software
Exchange Migration Tool- Shoviv Software
Shoviv Software
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
Tools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google CertificateTools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Serato DJ Pro Crack Latest Version 2025??
Serato DJ Pro Crack Latest Version 2025??Serato DJ Pro Crack Latest Version 2025??
Serato DJ Pro Crack Latest Version 2025??
Web Designer
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Gojek Clone App for Multi-Service Business
Gojek Clone App for Multi-Service BusinessGojek Clone App for Multi-Service Business
Gojek Clone App for Multi-Service Business
XongoLab Technologies LLP
 
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Meet the New Kid in the Sandbox - Integrating Visualization with PrometheusMeet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Eric D. Schabell
 
Wilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For WindowsWilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For Windows
Google
 
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World ExamplesMastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
jamescantor38
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
Artificial hand using embedded system.pptx
Artificial hand using embedded system.pptxArtificial hand using embedded system.pptx
Artificial hand using embedded system.pptx
bhoomigowda12345
 
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptxThe-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
james brownuae
 
From Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
From Vibe Coding to Vibe Testing - Complete PowerPoint PresentationFrom Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
From Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
Shay Ginsbourg
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
How to avoid IT Asset Management mistakes during implementation_PDF.pdf
How to avoid IT Asset Management mistakes during implementation_PDF.pdfHow to avoid IT Asset Management mistakes during implementation_PDF.pdf
How to avoid IT Asset Management mistakes during implementation_PDF.pdf
victordsane
 
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEMGDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
philipnathen82
 
Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025
GrapesTech Solutions
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Exchange Migration Tool- Shoviv Software
Exchange Migration Tool- Shoviv SoftwareExchange Migration Tool- Shoviv Software
Exchange Migration Tool- Shoviv Software
Shoviv Software
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
Tools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google CertificateTools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Serato DJ Pro Crack Latest Version 2025??
Serato DJ Pro Crack Latest Version 2025??Serato DJ Pro Crack Latest Version 2025??
Serato DJ Pro Crack Latest Version 2025??
Web Designer
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Meet the New Kid in the Sandbox - Integrating Visualization with PrometheusMeet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Eric D. Schabell
 
Wilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For WindowsWilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For Windows
Google
 
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World ExamplesMastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
jamescantor38
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
Artificial hand using embedded system.pptx
Artificial hand using embedded system.pptxArtificial hand using embedded system.pptx
Artificial hand using embedded system.pptx
bhoomigowda12345
 
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptxThe-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
james brownuae
 
From Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
From Vibe Coding to Vibe Testing - Complete PowerPoint PresentationFrom Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
From Vibe Coding to Vibe Testing - Complete PowerPoint Presentation
Shay Ginsbourg
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
How to avoid IT Asset Management mistakes during implementation_PDF.pdf
How to avoid IT Asset Management mistakes during implementation_PDF.pdfHow to avoid IT Asset Management mistakes during implementation_PDF.pdf
How to avoid IT Asset Management mistakes during implementation_PDF.pdf
victordsane
 

Big data Clustering Algorithms And Strategies

  • 1. Big data Clustering Algorithms & Strategies FARZAD NOZARIAN AMIRKABIR UNIVERSITY OF TECHNOLOGY – MARCH 2015 1
  • 2. Preprocessing Goals: 1. To assure the quality of the data by reducing the noisy and irrelevant information that it could contain 2. To reduce the size of the dataset, so the computational cost of the discovery task is also reduced. Reducing the size of dataset: ◦ Number of instances ◦ addressed by sampling (the sampled dataset should holds the same information that the whole dataset) ◦ Dimensionality reduction ◦ Feature selection ◦ Feature extraction 2
  • 3. Clustering algorithms Hierarchical methods ◦ Divisive ◦ Agglomerative Based on similarity matrix for each pair of examples Some algorithm consider this matrix as Graph; Other algorithm reduce the matrix each iteration by merging two groups. The main drawback of these algorithms is their computational cost. (o(n2)) Scanning the dataset many times! 3
  • 4. Prototype/model based clustering Prototype and model based clustering assume that clusters fit to a specific shape. Goal: Discover how different numbers of these shapes can explain the spatial distribution of the data. Must used prototype based clustering is K-Means. ◦ K-Means assumes that clusters are defined by their center (the prototype) and have spherical shapes. ◦ To feet this shape K-Means minimizing the distances from the examples to these centers. ◦ solved iteratively using a gradient descent algorithm. 4
  • 5. Density based clustering DBSCAN OPTICS is an extension of the original DBSCAN that uses heuristics to find good values for DBSCAN parameters. The main drawback of this methods comes from the cost of finding the nearest neighbors for an example. Indexing is a solution, but may be degraded with the number of dimensions to a linear search. 5
  • 6. Grid based clustering The basic idea: divide the space of instances in hyperrectangular cells by discretizing the attributes of the dataset. Clusters of arbitrary shapes. Each cell is summarized by the sufficient statistics of the examples it contains. Usually scale well, but it depends on the granularity of the discretization of the space of examples. The strategies used to prune the search space allow to largely reduce the computational cost 6
  • 7. Scalability strategies One-pass strategies Summarization strategies Sampling/batch strategies Approximation strategies Divide and conquer strategies 7
  • 8. One-pass strategies Reduce the number of scans of the data to only one. This constraint may be usually forced by the circumstance that the dataset can not fit in memory and it has to be obtained from disk. This is used to perform a preprocess of the dataset. This results in two stages algorithms, a first one that applies the one-pass strategy and a second one that process in memory a summary of the data obtained by the first stage. 8
  • 9. Summarization Strategies Purpose: obtain a coarse approximation of the data without losing the information that represent the different densities of examples. Sufficient statistics like mean and variance. The summarization can be performed single level, as a preprocess that is feed to a cluster algorithm. 9
  • 10. Sampling/batch strategies Purpose: Allow to perform the processing in main memory for a part of the dataset. In case of more than one sample of the data: The algorithm should be able to process raw data and cluster summaries. They scale on the size of the sampling and not on the size of the whole dataset. The use of batches assume that the data can be processed sequentially and that after applying a clustering algorithm to a batch, the result can be merged with the results from previous batches. Data stream! 10
  • 11. Approximation strategies These strategies assume that some computations can be saved or approximated with reduced or null impact on the final result. Algorithm dependent. Most costly part of clustering algorithms corresponds to distance computation among instances or among instances and prototypes. E.g, some of these algorithms are iterative and the decision about what partition is assigned to an example does not change after a few iterations. If this can be determined at an early stage, all these distance computations can be avoided in successive iterations. This strategy is usually combined with a summarization strategy where groups of examples are reduced to a point that is used to decide if the decision can be performed using only that point or the distances to all the examples have to be computed. 11
  • 12. Divide and conquer strategies Data can be divided in multiple independent datasets and that the clustering results can be then merged on a final model. 12
  • 14. PINK: A Scalable Algorithm for Single-Linkage Hierarchical Clustering on Distributed-Memory Architectures (2013) (northwestern) A scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. As PINK does not explicitly store a distance matrix, it can be applied to much larger problem sizes. Algorithm: ◦ Divide a large hierarchical clustering problem instance into a set of smaller sub-problems ◦ Calculate the hierarchical clustering dendrogram for each of these sub-problems ◦ Reconstruct the solution for the original dataset by combining the solutions to the sub-problems. 14
  • 15. Leader-single-link (l-SL): A distance based clustering method for arbitrary shaped clusters in large datasets (2011) Divides the clustering process in two steps: ◦ One pass clustering algorithm: resulting in a set of cluster summaries that reduce the size of the dataset. ◦ This new dataset fits in memory and can be processed using a single link hierarchical clustering algorithm. Leaders clustering method: is a single data-scan distance based partitional clustering method. For a given threshold distance τ, it produces a set of leaders L incrementally. For each pattern 𝑥, if there is a leader 𝑙 ϵ L such that 𝑥 − 𝑙 ≤ 𝜏, then 𝑥 is assigned to a cluster represented by 𝑙. If there is no such leader, then 𝑥 becomes a new leader. 15 One-passSummarization
  • 16. Leader-single-link (l-SL) (cont.) The k-means also is a leader algorithm but it is applicable to numerical dataset only and scans dataset more than once before convergence. After producing the leaders, the leaders set is further clustered using SL method with cut-off distance ℎ which results in clustering of leaders. Finally, each leader is replaced by its followers to produce final clustering. 16
  • 18. PDBSCAN 1. Divide the input into several partitions, and distribute these partitions to the available computers 2. Cluster partitions concurrently using DBSCAN 3. Combine or merge the clustering's of the partitions into a clustering of the whole database. 4. In distributed environment we should care about data placement: ◦ Load balancing: the partitions should be almost of equal size if we assume that all computers have the same performance ◦ Minimized communication cost: should avoid accessing those data located on any of the other computers ◦ Distributed data access: This is not applicable for MR! 18 DivideandconquerdR*-tree
  • 19. PDBSCAN (Cont.) Algorithm is based on the R*-tree, provides not only a spatial data placement strategy for clustering, but also efficient access to spatial data in a shared nothing architecture through the replication of indices. Proposed data placement solution: grouping the MBRs of leaf nodes of the R*-tree into N partitions such that the nearby MBRs should be assigned to the same partition and the partitions should be almost of equal size with respect to the number of MBRs. How this solution can be achieved? use space filling Hilbert curves For a given R*-tree, this method works as follows: ◦ Every data page of the R*-tree is assigned to a Hilbert value according to its center of gravity. So, successive data pages will be close in space. ◦ Sort the list of pairs by ascending Hilbert values. ◦ If the R*-tree has d data pages and we have n slaves, every slave obtains d/n data pages of the sorted list 19
  • 20. PDBSCAN (Cont.) Proposed efficient access to the distributed data solution: replicate the directory of the R*-tree on all available computers (dR*-tree) Now the PDBSCAN algorithm: ◦ Starts with an arbitrary point p within S and retrieves all points which are density-reachable from p ◦ If p is not a core point, no points are density-reachable from p: visits the next point in partition S ◦ If all members of C are contained in S: C is also a cluster ◦ If there are members of C outside of S: C may need to be merged with another cluster found call C a merging candidate 20
  • 21. PDBSCAN (Cont.) The master PDBSCAN receives a list of merging candidates from every SLAVE. PDBSCAN collects all the lists L it receives and assigns them to a list LL. A merging function is noting else a nested loop that check for each pair of cluster if their intersection aren’t empty! 21
  • 22. MR-DBSCAN (2011) Implement it by a 4-stages MapReduce paradigm. Contributions: quick partitioning strategy for large scale non-indexed data. Challenges of designing DBSCAN in MapReduce: ◦ Data interchange mechanism is limited. Data transferring between map and reduce is not encouraged. ◦ MapReduce doesn’t provide any mechanism such as R-tree, KD-tree to improve multidimensional search. ◦ Maximum parallelism can be achieved when the data is well balanced. PDBSCAN was has been the basis of their work. However it aggregate intermediate results in a single node, and MR-DBSCAN optimize this issue. 22 GridMR
  • 23. MR-DBSCAN (2011) (Cont.) Stage 1: Preprocessing: ◦ Main challenges for a partitioning strategy are: ◦ Load balancing ◦ Minimize communication or shuffling cost (all related records, including the data within space Si and its halo replication from bordering spaces, should easily map to a same key and be shuffled to target reducer) ◦ What is the problem of spatial index? (disadvantages of indexing in MapReduce) ◦ Most of them are required to do iteration recursion to get a hierarchical structure that is not practical in MapReduce. (BUT WHAT ABOUT SPARK?!) ◦ For large scale data its hierarchical index could reach one tenth of its original data size, which in huge and hard to handle. Proposed solution: grid file (divide the data domain in dimension i into mi portions, each of which is considered as a mini bucket.) 23
  • 24. MR-DBSCAN (2011) (Cont.) Stage 2: Local DBSCAN : ◦ In PDBSCAN each thread could access not just its partition data but global data during the processing of local DBSCAN algorithm. !!BAD in MapReduce!! The local DBSCAN algorithm will only scan data and extend core points within space Si. 24 When the cluster scan extends outside Si, assumed that a record q outside Si is directly-density- reachable from a core point p in Si, we will not detect whether q is a core point anymore. q will be marked as ‘On-queue’ status and put into Merge Candidates set (MC set) with core point p as well.
  • 25. MR-DBSCAN (2011) (Cont.) Stage 3: Find Merging Mapping: They optimized single node aggregation bottleneck in this section! In PDBSCAN to merge the cluster from different subspaces: ◦ Collect the entire MC into a big list LL ◦ Among all the points in the list, execute a nested loop to find out whether two item with a same point id are from different clusters. ◦ If found, merging the cluster. 25
  • 26. MR-DBSCAN (2011) (Cont.) Stage 4: Merge Stage 4.1: Build Global Mapping: We get several id lists of clusters to be merged for each two bordering space. (i, c1)<->(i+1, c2) The output of this section is the mapping ((gridID, localclusterID), globalclusterID) for each local cluster in each partition. Stage 4.2: Merge and Relabel: The final stage of algorithm is streaming all the local clustered records over the map-reduce process and replacing their local cluster id with a new global cluster id (gid) based on the mapping profile from Stage 4.1. 26
  • 27. DBCURE (2014) DBCURE utilizes ellipsoidal τ-neighborhoods instead of spherical ε-neighborhoods and has a desirable property of being less sensitive to density parameters. DBCURE is more suitable than OPTICS for being parallelized with MapReduce since the ellipsoidal τ-neighborhood of each point can be determined in parallel. User R*-tree efficiently to find the ellipsoidal τ-neighborhoods of a given point. 27 R*-treeIndexingMRGrid
  • 29. K-Means Algorithms Its popularity can be attributed to several reasons: 1. It is conceptually simple and easy to implement. 2. It is versatile, i.e., almost every aspect of the algorithm (initialization, distance function, termination criterion, etc.) can be modified. (This is evidenced by hundreds of publications over the last fifty years that extend k-means in a variety of ways.) 3. It has a time complexity that is linear in N, D, and K (in general, D ≪ N and K ≪ N) 4. It has a storage complexity that is linear in N, D, and K 5. It is guaranteed to converge at a quadratic rate 6. It is invariant to data ordering, i.e., random shuffling of the data points (MapReduce balance!) 29
  • 30. K-Means Algorithms (Cont.) k-means has several significant disadvantages: 1. It requires the number of clusters, K, to be specified in advance. ◦ Can be determined automatically by means of various internal/relative cluster validity measures. 2. It can only detect compact, hyper spherical clusters that are well separated. ◦ Can be alleviated by using a more general distance function such as the Mahalanobis distance, which permits the detection of hyper ellipsoidal clusters. 3. It is sensitive to noise and outlier points. ◦ Can be addressed by outlier pruning or by using a more robust distance function such as the city-block (ℓ1) distance. 4. It often converges to a local minimum of the criterion function. ◦ For the same reason, it is highly sensitive to the selection of the initial centers 30
  • 31. K-Means Algorithms (Cont.) The Obstacles of Very Large Datasets Clustering Using K-Means: ◦ computational complexity of distance calculations; ◦ The number of iterations which significantly increases when the number of sample data increases. Proposed idea to solve these obstacles: ◦ Solved by using MapReduce model to distribute computations ◦ Solved by using two-stages K-Means algorithm or K-Means++ algorithm 31
  • 32. K-Medoids Both K-Means and K-Medoids attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. K-Medoids chooses data points as centers (medoids or exemplars) and works with an arbitrary matrix of distances between data points instead of 𝜄2 32
  • 33. PAM: Partitioning Around Medoids 1. Initialize: randomly select k of the n data points as the medoids 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration 4. Select the configuration with the lowest cost. 5. Repeat steps 2 to 4 until there is no change in the medoid 33
  • 34. CLARA/CLARANS Reduces the number of medoids' calculations through sampling. A small portion of data is firstly selected from the whole datasets and then PAM is used to search the cluster medoids 34 Sampling
  • 35. Fast clustering using MapReduce (2011, KDD) K-center: the goal is to choose the centers such that the maximum distance between a center and a point assigned to it is minimized. K-median: It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. (the 1-norm distance metric, as opposed to the square of the 2-norm) Assume that the input is a weighted complete graph G = (V;E) that has an edge xy between any two points in V , and the weight of the edge xy is d(x; y) First idea: Adoption of existing algorithms to MR: ◦ Partition input across machines ◦ Each machine perform computation to sparsify data ◦ Results are collected in single machine and perform computation and final solution. Unfortunately the total running time of the algorithm can be quite large: It runs costly clustering algorithm on Ω(𝑘 𝑛) 35 ParallelSamplingMR
  • 36. Fast clustering using MapReduce (2011, KDD) (Cont.) This algorithm uses Iterative-Sample as a subroutine: ◦ Performs the following computation in parallel across the machines: ◦ In each round, it adds a small sample of points to the final sample, it determines which points are “well represented” by the sample, and it recursively considersonly the points that are not well represented After a good/strong sampling, they put the sampled points on a single machine and run a clustering algorithm on just the sampled points. They also describe about 3 page about their mathematical proof of their good iterative sampling algorithm. 36
  • 37. PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) Map function: Assign each sample to the closest center Reduce function: Performs the procedure of updating the new centers. Combiner function: Deal with partial combination of the intermediate values with the same key within the same map task 37 MR
  • 38. PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.) 38
  • 39. PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.) 39
  • 40. PK-Means: Parallel K-Means Clustering Based on MapReduce (2099) (Cont.) 40
  • 41. FMR.K-Means: Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method (2015) Presents a new approach for reducing the number of iterations of K-Means algorithm Based on Parallel K-Means based on the MapReduce. Propose a new method called cutting off the last iterations based on differences between centers of each cluster of two adjacent iterations. 41 MRIterationElimination
  • 42. Canopy Clustering (KDD 2000) Canopy works with datasets that either: ◦ Having millions of data points ◦ Thousands of dimensions ◦ Thousands of clusters Key idea: Using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets (Canopies), then clustering is performed by measuring exact distances only between points that occur in a common canopy. Use domain-specific features in order to design a cheap distance metric and efficiently create canopies using the metric. A fast distance metrics for text used by search engines are based on the inverted index. 42 ApproximationTwo-stage
  • 43. Fuzzy C-Means (FCM) Given a finite set of data, the algorithm returns a list of c cluster centers and a partition matrix, where each element of matrix tells the degree to which element xi belongs to cluster ci. Like the k-means algorithm, the FCM aims to minimize an objective function: This differs from the k-means objective function by the addition of the membership values wij and the fuzzifier m. The fuzzifier m determines the level of cluster fuzziness. 43
  • 44. K-Means + Canopy: An Integrated Clustering Framework Using Optimized K-means with Firefly and Canopies (2015) Proposed by integration of two meta-heuristic algorithms: Firefly algorithm and Canopy 44 ApproximationTwo-stage
  • 45. K-medoids Clustering Based on MapReduce and Optimal Search of Medoids (2014) Proposed an improved algorithm based on MapReduce and optimal search of medoids. According to the basic properties of triangular geometry, this paper reduced calculation of distances among data elements to help search medoids quickly and reduce the calculation complexity of k-medoids. 45 MROptimalSearch
  翻译: