SlideShare a Scribd company logo
Running Head: CLUSTERING TECHNIQUES
CLUSTERING TECHNIQUES
NAME:
INSTITUTION:
CLUSTERING TECHNIQUES
Introduction
Clustering and classification are a fundamental activity in Data Mining. Classification is
therefore used mostly as a supervised learning process while clustering is used for unsupervised
learning and clustering model is for the two (Raymond, & Jiawei 2003). The main aim of
clustering is, descriptive, for classification is predictive. The assessment is extrinsic, since, the
aim of clustering is to completely discover a new set of categories; the new categories are of
interest in there on. Extrinsic is a very crucial part of the assessment since it is a must for the
group to reflect part of a reference set of classes (Jiawei & Michelle, 2001). While on the hand
Similarity is the measure of how much three or even more items are relevant. Similarity can,
therefore, be seen as the numerical distance between multiple data objects that are typically
represented as a value between the range of 0 not similar and 1completely similar. The triangle
inequality between objects may hold, depending on the similarity metric used, but the two
properties that must be maintained for similarities is the measure of similarity which must fall in
between the range of 0 and 1 and the symmetry (Dunn, 2004). Symmetry, therefore, being the
main property that describes that for all x and y the similarity of x and y must also be the same as
the similarity of y and x (Achtert,et a. 2007).
Advantages of the Clustering Techniques
Change requests can easily consolidate according to structured data which the value domain is
completely defined. Questions, e.g., how many modification requests have been submitted to
priority or severity level, a category, and which ones are still in the Open state can be responded
with a simple IBM® Rational Team Concert™ query (Raymond, & Jiawei 2003).
There are advantages that can be obtained only by deriving intelligence from this type of data
and combining change requests based on unstructured data is not trivial. It therefore describes an
CLUSTERING TECHNIQUES
approach for investigating Rational Team Concert modified request patterns by tokenizing the
text attributes and also applying machine learning techniques, specifically clustering algorithms
that group changes requests by similarity. In using this type of analysis, software development
teams benefit in the following areas (Achtert,et a. 2007).
Quality improvement
There might be an opportunity for improving the process that relates to that area that would
reduce the number of future issues if many change requests are associated with the general theme
(Jiawei, & Michelle, 2001). Reuse; following overall framework or even applying the same
solution pattern Change requests of the same grouping might be solved by a similar approach.
Finding duplicates; before even submitting a new change request by checking at the similar
request, it is more efficient to search for duplicates collaboration patterns: it is the understanding
of the team members which contribute to solving related change requests can assist substantiate
organizational change decisions, refine career goals, and develop or improve skills (Achtert,et a.
2007).
Clustering methods
Clustering and classification are fundamental activities in Data Mining. Classification is
mostly used as a supervised learning procedure while clustering is used for unsupervised
learning and some clustering models are even for both (Raymond, & Jiawei 2003). The aim of
clustering is descriptive, and that of classification is predictive. The new groups are therefore of
interest in their assessment is intrinsic, and themselves. An important part of the assessment is
extrinsic in classification tasks (Dunn, 2004).Since the groups must reflect some of the reference
set of classes. Clustering groups data illustrates into subsets in such a manner that similar cases
CLUSTERING TECHNIQUES
grouped together, while different cases belong to different categories. The instances are thereby
re-organized into an efficient representation that characterizes the inhabitant being kept for
sample (Achtert,et a. 2007).
Therefore for the clustering structure is represented as a set of subsets C=C1….,Ck of S, such
that :S=UK-1 Ci and CiCj=; for i6=j. Consequently, any occurrence in S belongs to only one
subset. Clustering of objects is as vulnerable as the human requirement for describing the salient
characteristics of men and objects and also identifying them with a type (Raymond, & Jiawei
2003). It embraces various principals scientist from calculations and statistics to biology and
genetics; each uses different terms to describe the topologies formed with the use of this analysis.
As biological taxonomies to medical syndromes and also genetic genotypes to producing group
technology, the demerit is identical which forms categories of entities and assigning individuals
to the proper groups within it (Jiawei, & Michelle, 2001).
Since clustering is the grouping of similar objects, some measure that can determine whether two
objects are relevant or not relevant is required (Dunn, 2004). There are two types of measures
that are used to estimate this relation, similarity measures and distance measures. Several
clustering methods use distance measures to obtain the similarity or non-similarity between the
pairs of objects. It is crucial to denote the difference between two points xi and xj as: d (xi, xj).
Valid distance measure should always be symmetric and should also obtain its minimum value
which is usually zero in the case of the identical vectors (Raymond, & Jiawei 2003). The
distance measure is called a metric distance measure if it also satisfies the following properties
K-Means clustering goals to divide n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, representing a prototype of the cluster. It results in a
CLUSTERING TECHNIQUES
partitioning of the data space into Voronoi cells. K-means is the most simplest unsupervised
learning algorithms that solve the well-known clustering matter (Jiawei, & Michelle, 2001). The
process follows a very simple way to classify a given data set by a certain figure of clusters,
therefore, assume k clusters (fixed a priori). The essential idea is to define k centroids, one for
each cluster. These centroids should be placed in a scheming way because of different location
causes a different result. Therefore, the best choice is to put them as much as possible distance
from one other.
The following step is to take each position that belongs to a given set of data related with the
centroid that much near (Achtert,et a. 2007). The first step is completed, when no point is
pending, and an early groupage is over. At this position, it is required to recalculate k new
centroids to be barycenters of the clusters as a result of the previous step. After the k new
centroids, a new binding has to be done within the same set of data points and the closest new
centroid. A loop has therefore been generated. As a result of the loop, it might be noticed that the
k centroids change their location step by step until all changes are over. this means, centroids do
not move any further.
where
Is a selected distance portion between a data point and the cluster center, is an indicator of the
distance of the n data points from the cluster positions.
CLUSTERING TECHNIQUES
The algorithm is composed of the following steps:
1. Position K points into the space that is represented by the objects that are being clustered.
These points represent initial group centroids.
2. categorize each object to the group that has the closest centroid.
3. Recalculation of the position of K centroid after assigning all the objects.
4. Repeat Steps 2 and 3 until the centroids cannot move anymore. It will generates a partition
of the objects into groups from which the metric to be minimized can also be calculated.
However, it will be proved that the procedure will always eliminate, the k-means algorithm does
not necessarily find the most optimal calculation, equating to the global objective function
minimum (Raymond, & Jiawei 2003). The algorithm is also significantly sensitive to the past
randomly selected cluster centers. The k-means algorithm can be run several times to lower this
effect (Jiawei, & Michelle, 2001).
K-means is a simple algorithm that has been adapted to many problem domains. As it is going to
be observed, the best candidate for extension to work with fuzzy feature vectors.
E.g., Make the past guesses for the means m1, m2, ..., mk Suppose that we have an n sample
feature vectors x1, x2, ..., xn all from the same position, which they get into k compact clusters, k
< n. Let mi be the mean of the vectors in cluster i. If the clusters are separated, we can use a
minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || x -ji ||
is the average of the entire k distances. It, therefore, suggests the following procedure for finding
the k means, until no changes is noticed in any mean (Achtert,et a. 2007)..
CLUSTERING TECHNIQUES
The k-means procedure is a simple version; it can be viewed as a greedy algorithm for dividing
the n samples into k clusters to minimize the overall of the squared differences between the
cluster centers. It has some weaknesses like as follows:
• One popular way to start is to choose k of the samples randomly, since way to initialize the
means has not been specified .
• The results produced depend on the initial values for the means, and it frequently happens
that suboptimal partitions are found. The required solution is to try some variable starting points.
• It might happen that the set of samples nearest to mi is empty hence mi cannot be updated.
This is a very embarrassing situation that must be handled in an implementation.
• The results depend on the metric used to measure || x - mi ||. A popular way is to stabilize
each difference by its quality deviation, however it is not desirable.
• The results depend on the value of k.
CLUSTERING TECHNIQUES
This last problem is particularly disappointing since we often have no way of knowing how
many clusters exist.
Unfortunately there is no overall theoretical solution to finding the optimal amount of
clusters for every given set of data . A simple approach is to do comparison the results of
multiple moves with different k classes and choose the best one according to a given criterion
however it is required to be careful since increasing k results in smaller error function
evaluations by definition, but also increasing risk of overfitting (Raymond, & Jiawei 2003).
Advantages of k-means clustering
Time Complexity
According to Shehroz Khan 2015, the solution on execution time, K-means is linear in
the number of data objects i.e. O(n), in which n refers to number of data objects. The time
complexity of most of the hierarchical cluster algorithms is a quadratic i.e. O(n2). For the same
amount of data, hierarchical clustering therefore takes quadratic amount of time (Jiawei, &
Michelle, 2001).
CLUSTERING TECHNIQUES
Shape of Clusters
K-means works well when the shape of clusters is hyper-spherical or even
circular in 2 dimensions. If the natural clusters occurring in the dataset are not spherical,
probably K-means is not best option (Dunn, 2004).
Repeatability
K-means starts with a random choice of cluster centers; it may, therefore, yield
different clustering results on several runs of the algorithm (Achtert, et a. 2007). Hence, the
results might lack consistency and also not be repeatable. It will most definitely result in the
same clustering with hierarchical clustering, results.
Cosine similarity
Cosine similarity is an evaluation of a product space that measures the cosine of the angle
between them (Dunn, 2004). The cosine of 0 degrees is 1, and it is below 1 for any other
different angle. It is not a judgment of magnitude orientation but orientation: two(2) vectors with
the same orientation have a cosine similarity of 1, two vectors at over 90° have a similarity of 0,
and two(2) vectors diametrically opposed have a similarity of -1, impartial of their magnitude.
Cosine similarity is completely used in positive space, where the outcome is neatly bounded in
(Raymond, & Jiawei 2003)].
Note that these bounds can apply for any number of calculations, and cosine variable is most
likely used in high-dimensional positive spaces (Achtert, et a. 2007). For example, text mining
CLUSTERING TECHNIQUES
and in information retrieval, each term is globally assigned a divergent dimension, and a
document is distinguished by a vector where the sum of each dimension concur to the amount of
times that term appears in the document. Cosine similarity then awards a very crucial value of
how similar two documents are likely to be regarding their subject matter.
The skill is also used to calculate the cohesion within clusters in the point of data mining. Cosine
distance is a term used for the complement in positive space. It is important to note, that it is not
an exact distance metric as it does not contain the triangle inequality property and it violates the
conjunction axiom; to repair the triangle inequity property while keeping the original ordering, it
is required to transform to angular distance (Dunn, 2004). Non-zero dimensions need to be
accounted is one of the reasons for the collaboration of cosine similarity which is very efficient
to evaluate, especially for sparse vectors(Achter t, et al. 2007).
CLUSTERING TECHNIQUES
References
Achtert, E. Bohm, C. Kriegel, H. P. Kröger, P. & Zimek, A. (2007). On Exploring Complex
Relationships of Correlation Clusters. 19th International Conference on Scientific and
Statistical Database Management
Dunn, J. (2004). "Well separated clusters and optimal fuzzy partitions.
Journal of Cybernetics.
Jiawei Han & Michelle Kamber (2001). Data Mining: Concepts & Techniques.
Morgan Kaufmann,
Raymond, T. N. & Jiawei H.(2003). Efficient and Effective Clustering Methods for Spatial Data
Mining. Santiago,Chile. Morgan Kaufmann.
Ad

More Related Content

What's hot (17)

Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
IOSR Journals
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Dr Athar Khan
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Baivab Nag
 
Program_Cluster_Analysis
Program_Cluster_AnalysisProgram_Cluster_Analysis
Program_Cluster_Analysis
Sammya Sengupta
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
IJRES Journal
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)
TIEZHENG YUAN
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
IAEME Publication
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
Cs501 cluster analysis
Cs501 cluster analysisCs501 cluster analysis
Cs501 cluster analysis
Kamal Singh Lodhi
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
Suryakumar Thangarasu
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
Mohaiminur Rahman
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
rajshreemuthiah
 
K044055762
K044055762K044055762
K044055762
IJERA Editor
 
Rajia cluster analysis
Rajia cluster analysisRajia cluster analysis
Rajia cluster analysis
College of Fisheries, KVAFSU, Mangalore, Karnataka
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
ijcsity
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
s v
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
IOSR Journals
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Baivab Nag
 
Program_Cluster_Analysis
Program_Cluster_AnalysisProgram_Cluster_Analysis
Program_Cluster_Analysis
Sammya Sengupta
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
IJRES Journal
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)
TIEZHENG YUAN
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
IAEME Publication
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
Suryakumar Thangarasu
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
Mohaiminur Rahman
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
ijcsity
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
s v
 

Similar to Clustering techniques final (20)

8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
nooriasukmaningtyas
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
Premkumar R
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
PalaniKumarR2
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
prjpublications
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
prjpublications
 
partitioning methods in data mining .pptx
partitioning methods in data mining .pptxpartitioning methods in data mining .pptx
partitioning methods in data mining .pptx
BodhanLaxman1
 
K means report
K means reportK means report
K means report
Gaurav Handa
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
IJERA Editor
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
Clustering
ClusteringClustering
Clustering
NLPseminar
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
Editor Jacotech
 
47 292-298
47 292-29847 292-298
47 292-298
idescitation
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
nooriasukmaningtyas
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
Premkumar R
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
PalaniKumarR2
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
prjpublications
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
prjpublications
 
partitioning methods in data mining .pptx
partitioning methods in data mining .pptxpartitioning methods in data mining .pptx
partitioning methods in data mining .pptx
BodhanLaxman1
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
IJERA Editor
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
Ad

Recently uploaded (20)

Ethics of Bird Watching : Guide for the Birders
Ethics of Bird Watching : Guide for the BirdersEthics of Bird Watching : Guide for the Birders
Ethics of Bird Watching : Guide for the Birders
Rahim Shaikh
 
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptxWater-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
baquaullah786
 
Environmental Studies : Types of Ecosystem.pptx
Environmental Studies : Types of Ecosystem.pptxEnvironmental Studies : Types of Ecosystem.pptx
Environmental Studies : Types of Ecosystem.pptx
vvsasane
 
Land Utilization (Agricultural, Pastoral, Horticultural.pdf
Land Utilization (Agricultural, Pastoral, Horticultural.pdfLand Utilization (Agricultural, Pastoral, Horticultural.pdf
Land Utilization (Agricultural, Pastoral, Horticultural.pdf
Nistarini College, Purulia (W.B) India
 
Chisinau Team „Tomorrow Together” Project
Chisinau Team „Tomorrow Together” ProjectChisinau Team „Tomorrow Together” Project
Chisinau Team „Tomorrow Together” Project
Daniela Munca-Aftenev
 
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
Taqyea
 
The Role of Technology in Modern Flood Risk Management Services
The Role of Technology in Modern Flood Risk Management ServicesThe Role of Technology in Modern Flood Risk Management Services
The Role of Technology in Modern Flood Risk Management Services
wrightcontractingseo
 
10 Air and Water Pollution Events Before and After 2010.pptx
10 Air and Water Pollution Events Before and After 2010.pptx10 Air and Water Pollution Events Before and After 2010.pptx
10 Air and Water Pollution Events Before and After 2010.pptx
Monoarul Haq Omy
 
Study of Certain Behavior of Rhesus Macaques
Study of Certain Behavior of Rhesus MacaquesStudy of Certain Behavior of Rhesus Macaques
Study of Certain Behavior of Rhesus Macaques
Rahim Shaikh
 
Optimisation of the wastewater treatment plant.pdf
Optimisation of the wastewater treatment plant.pdfOptimisation of the wastewater treatment plant.pdf
Optimisation of the wastewater treatment plant.pdf
ssuser6a09bd
 
Nicaragua Jacqueline Villachica 2024.pptx
Nicaragua Jacqueline Villachica 2024.pptxNicaragua Jacqueline Villachica 2024.pptx
Nicaragua Jacqueline Villachica 2024.pptx
pruebasgratisnica
 
Patterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
Patterns of Evolution ; Patterns of Evolution ; Patterns of EvolutionPatterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
Patterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
DrSnehaVerma1
 
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
Taqyea
 
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdfEco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
llsharma1
 
Agroforestry for Ecosystem Services.pptx
Agroforestry for Ecosystem Services.pptxAgroforestry for Ecosystem Services.pptx
Agroforestry for Ecosystem Services.pptx
RitikaMaurya20
 
Indian Monsoon and it's impact of the region of India
Indian Monsoon and it's impact of the region of IndiaIndian Monsoon and it's impact of the region of India
Indian Monsoon and it's impact of the region of India
adobefirefly678
 
Annual Action Plan for Agriculture and Allied Department
Annual Action Plan for Agriculture and Allied DepartmentAnnual Action Plan for Agriculture and Allied Department
Annual Action Plan for Agriculture and Allied Department
kvksatna1
 
life below water, a presentation by kendall page
life below water, a presentation by kendall pagelife below water, a presentation by kendall page
life below water, a presentation by kendall page
wallowcity
 
SULFUR CYCLE powerpoint presentation for UG
SULFUR CYCLE powerpoint presentation for UGSULFUR CYCLE powerpoint presentation for UG
SULFUR CYCLE powerpoint presentation for UG
agritricks2000
 
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous WasteSoil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Dr. Manoj Garg
 
Ethics of Bird Watching : Guide for the Birders
Ethics of Bird Watching : Guide for the BirdersEthics of Bird Watching : Guide for the Birders
Ethics of Bird Watching : Guide for the Birders
Rahim Shaikh
 
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptxWater-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
Water-Pollution-A-Growing-Threat_(1)_(2)[2].pptx
baquaullah786
 
Environmental Studies : Types of Ecosystem.pptx
Environmental Studies : Types of Ecosystem.pptxEnvironmental Studies : Types of Ecosystem.pptx
Environmental Studies : Types of Ecosystem.pptx
vvsasane
 
Chisinau Team „Tomorrow Together” Project
Chisinau Team „Tomorrow Together” ProjectChisinau Team „Tomorrow Together” Project
Chisinau Team „Tomorrow Together” Project
Daniela Munca-Aftenev
 
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
文凭新西兰AIS文凭奥克兰商学院录取通知书假毕业证购买
Taqyea
 
The Role of Technology in Modern Flood Risk Management Services
The Role of Technology in Modern Flood Risk Management ServicesThe Role of Technology in Modern Flood Risk Management Services
The Role of Technology in Modern Flood Risk Management Services
wrightcontractingseo
 
10 Air and Water Pollution Events Before and After 2010.pptx
10 Air and Water Pollution Events Before and After 2010.pptx10 Air and Water Pollution Events Before and After 2010.pptx
10 Air and Water Pollution Events Before and After 2010.pptx
Monoarul Haq Omy
 
Study of Certain Behavior of Rhesus Macaques
Study of Certain Behavior of Rhesus MacaquesStudy of Certain Behavior of Rhesus Macaques
Study of Certain Behavior of Rhesus Macaques
Rahim Shaikh
 
Optimisation of the wastewater treatment plant.pdf
Optimisation of the wastewater treatment plant.pdfOptimisation of the wastewater treatment plant.pdf
Optimisation of the wastewater treatment plant.pdf
ssuser6a09bd
 
Nicaragua Jacqueline Villachica 2024.pptx
Nicaragua Jacqueline Villachica 2024.pptxNicaragua Jacqueline Villachica 2024.pptx
Nicaragua Jacqueline Villachica 2024.pptx
pruebasgratisnica
 
Patterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
Patterns of Evolution ; Patterns of Evolution ; Patterns of EvolutionPatterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
Patterns of Evolution ; Patterns of Evolution ; Patterns of Evolution
DrSnehaVerma1
 
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
德国法兰克福大学学历认证范本学生卡(成绩单复刻)制作
Taqyea
 
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdfEco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
Eco-web Udaipur Newsletter, April 2025 Vol 1(2).pdf
llsharma1
 
Agroforestry for Ecosystem Services.pptx
Agroforestry for Ecosystem Services.pptxAgroforestry for Ecosystem Services.pptx
Agroforestry for Ecosystem Services.pptx
RitikaMaurya20
 
Indian Monsoon and it's impact of the region of India
Indian Monsoon and it's impact of the region of IndiaIndian Monsoon and it's impact of the region of India
Indian Monsoon and it's impact of the region of India
adobefirefly678
 
Annual Action Plan for Agriculture and Allied Department
Annual Action Plan for Agriculture and Allied DepartmentAnnual Action Plan for Agriculture and Allied Department
Annual Action Plan for Agriculture and Allied Department
kvksatna1
 
life below water, a presentation by kendall page
life below water, a presentation by kendall pagelife below water, a presentation by kendall page
life below water, a presentation by kendall page
wallowcity
 
SULFUR CYCLE powerpoint presentation for UG
SULFUR CYCLE powerpoint presentation for UGSULFUR CYCLE powerpoint presentation for UG
SULFUR CYCLE powerpoint presentation for UG
agritricks2000
 
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous WasteSoil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Soil & Nuclear Pollution, Solid Waste, Noise pollution, Hazardous Waste
Dr. Manoj Garg
 
Ad

Clustering techniques final

  • 1. Running Head: CLUSTERING TECHNIQUES CLUSTERING TECHNIQUES NAME: INSTITUTION:
  • 2. CLUSTERING TECHNIQUES Introduction Clustering and classification are a fundamental activity in Data Mining. Classification is therefore used mostly as a supervised learning process while clustering is used for unsupervised learning and clustering model is for the two (Raymond, & Jiawei 2003). The main aim of clustering is, descriptive, for classification is predictive. The assessment is extrinsic, since, the aim of clustering is to completely discover a new set of categories; the new categories are of interest in there on. Extrinsic is a very crucial part of the assessment since it is a must for the group to reflect part of a reference set of classes (Jiawei & Michelle, 2001). While on the hand Similarity is the measure of how much three or even more items are relevant. Similarity can, therefore, be seen as the numerical distance between multiple data objects that are typically represented as a value between the range of 0 not similar and 1completely similar. The triangle inequality between objects may hold, depending on the similarity metric used, but the two properties that must be maintained for similarities is the measure of similarity which must fall in between the range of 0 and 1 and the symmetry (Dunn, 2004). Symmetry, therefore, being the main property that describes that for all x and y the similarity of x and y must also be the same as the similarity of y and x (Achtert,et a. 2007). Advantages of the Clustering Techniques Change requests can easily consolidate according to structured data which the value domain is completely defined. Questions, e.g., how many modification requests have been submitted to priority or severity level, a category, and which ones are still in the Open state can be responded with a simple IBM® Rational Team Concert™ query (Raymond, & Jiawei 2003). There are advantages that can be obtained only by deriving intelligence from this type of data and combining change requests based on unstructured data is not trivial. It therefore describes an
  • 3. CLUSTERING TECHNIQUES approach for investigating Rational Team Concert modified request patterns by tokenizing the text attributes and also applying machine learning techniques, specifically clustering algorithms that group changes requests by similarity. In using this type of analysis, software development teams benefit in the following areas (Achtert,et a. 2007). Quality improvement There might be an opportunity for improving the process that relates to that area that would reduce the number of future issues if many change requests are associated with the general theme (Jiawei, & Michelle, 2001). Reuse; following overall framework or even applying the same solution pattern Change requests of the same grouping might be solved by a similar approach. Finding duplicates; before even submitting a new change request by checking at the similar request, it is more efficient to search for duplicates collaboration patterns: it is the understanding of the team members which contribute to solving related change requests can assist substantiate organizational change decisions, refine career goals, and develop or improve skills (Achtert,et a. 2007). Clustering methods Clustering and classification are fundamental activities in Data Mining. Classification is mostly used as a supervised learning procedure while clustering is used for unsupervised learning and some clustering models are even for both (Raymond, & Jiawei 2003). The aim of clustering is descriptive, and that of classification is predictive. The new groups are therefore of interest in their assessment is intrinsic, and themselves. An important part of the assessment is extrinsic in classification tasks (Dunn, 2004).Since the groups must reflect some of the reference set of classes. Clustering groups data illustrates into subsets in such a manner that similar cases
  • 4. CLUSTERING TECHNIQUES grouped together, while different cases belong to different categories. The instances are thereby re-organized into an efficient representation that characterizes the inhabitant being kept for sample (Achtert,et a. 2007). Therefore for the clustering structure is represented as a set of subsets C=C1….,Ck of S, such that :S=UK-1 Ci and CiCj=; for i6=j. Consequently, any occurrence in S belongs to only one subset. Clustering of objects is as vulnerable as the human requirement for describing the salient characteristics of men and objects and also identifying them with a type (Raymond, & Jiawei 2003). It embraces various principals scientist from calculations and statistics to biology and genetics; each uses different terms to describe the topologies formed with the use of this analysis. As biological taxonomies to medical syndromes and also genetic genotypes to producing group technology, the demerit is identical which forms categories of entities and assigning individuals to the proper groups within it (Jiawei, & Michelle, 2001). Since clustering is the grouping of similar objects, some measure that can determine whether two objects are relevant or not relevant is required (Dunn, 2004). There are two types of measures that are used to estimate this relation, similarity measures and distance measures. Several clustering methods use distance measures to obtain the similarity or non-similarity between the pairs of objects. It is crucial to denote the difference between two points xi and xj as: d (xi, xj). Valid distance measure should always be symmetric and should also obtain its minimum value which is usually zero in the case of the identical vectors (Raymond, & Jiawei 2003). The distance measure is called a metric distance measure if it also satisfies the following properties K-Means clustering goals to divide n observations into k clusters in which each observation belongs to the cluster with the nearest mean, representing a prototype of the cluster. It results in a
  • 5. CLUSTERING TECHNIQUES partitioning of the data space into Voronoi cells. K-means is the most simplest unsupervised learning algorithms that solve the well-known clustering matter (Jiawei, & Michelle, 2001). The process follows a very simple way to classify a given data set by a certain figure of clusters, therefore, assume k clusters (fixed a priori). The essential idea is to define k centroids, one for each cluster. These centroids should be placed in a scheming way because of different location causes a different result. Therefore, the best choice is to put them as much as possible distance from one other. The following step is to take each position that belongs to a given set of data related with the centroid that much near (Achtert,et a. 2007). The first step is completed, when no point is pending, and an early groupage is over. At this position, it is required to recalculate k new centroids to be barycenters of the clusters as a result of the previous step. After the k new centroids, a new binding has to be done within the same set of data points and the closest new centroid. A loop has therefore been generated. As a result of the loop, it might be noticed that the k centroids change their location step by step until all changes are over. this means, centroids do not move any further. where Is a selected distance portion between a data point and the cluster center, is an indicator of the distance of the n data points from the cluster positions.
  • 6. CLUSTERING TECHNIQUES The algorithm is composed of the following steps: 1. Position K points into the space that is represented by the objects that are being clustered. These points represent initial group centroids. 2. categorize each object to the group that has the closest centroid. 3. Recalculation of the position of K centroid after assigning all the objects. 4. Repeat Steps 2 and 3 until the centroids cannot move anymore. It will generates a partition of the objects into groups from which the metric to be minimized can also be calculated. However, it will be proved that the procedure will always eliminate, the k-means algorithm does not necessarily find the most optimal calculation, equating to the global objective function minimum (Raymond, & Jiawei 2003). The algorithm is also significantly sensitive to the past randomly selected cluster centers. The k-means algorithm can be run several times to lower this effect (Jiawei, & Michelle, 2001). K-means is a simple algorithm that has been adapted to many problem domains. As it is going to be observed, the best candidate for extension to work with fuzzy feature vectors. E.g., Make the past guesses for the means m1, m2, ..., mk Suppose that we have an n sample feature vectors x1, x2, ..., xn all from the same position, which they get into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || x -ji || is the average of the entire k distances. It, therefore, suggests the following procedure for finding the k means, until no changes is noticed in any mean (Achtert,et a. 2007)..
  • 7. CLUSTERING TECHNIQUES The k-means procedure is a simple version; it can be viewed as a greedy algorithm for dividing the n samples into k clusters to minimize the overall of the squared differences between the cluster centers. It has some weaknesses like as follows: • One popular way to start is to choose k of the samples randomly, since way to initialize the means has not been specified . • The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The required solution is to try some variable starting points. • It might happen that the set of samples nearest to mi is empty hence mi cannot be updated. This is a very embarrassing situation that must be handled in an implementation. • The results depend on the metric used to measure || x - mi ||. A popular way is to stabilize each difference by its quality deviation, however it is not desirable. • The results depend on the value of k.
  • 8. CLUSTERING TECHNIQUES This last problem is particularly disappointing since we often have no way of knowing how many clusters exist. Unfortunately there is no overall theoretical solution to finding the optimal amount of clusters for every given set of data . A simple approach is to do comparison the results of multiple moves with different k classes and choose the best one according to a given criterion however it is required to be careful since increasing k results in smaller error function evaluations by definition, but also increasing risk of overfitting (Raymond, & Jiawei 2003). Advantages of k-means clustering Time Complexity According to Shehroz Khan 2015, the solution on execution time, K-means is linear in the number of data objects i.e. O(n), in which n refers to number of data objects. The time complexity of most of the hierarchical cluster algorithms is a quadratic i.e. O(n2). For the same amount of data, hierarchical clustering therefore takes quadratic amount of time (Jiawei, & Michelle, 2001).
  • 9. CLUSTERING TECHNIQUES Shape of Clusters K-means works well when the shape of clusters is hyper-spherical or even circular in 2 dimensions. If the natural clusters occurring in the dataset are not spherical, probably K-means is not best option (Dunn, 2004). Repeatability K-means starts with a random choice of cluster centers; it may, therefore, yield different clustering results on several runs of the algorithm (Achtert, et a. 2007). Hence, the results might lack consistency and also not be repeatable. It will most definitely result in the same clustering with hierarchical clustering, results. Cosine similarity Cosine similarity is an evaluation of a product space that measures the cosine of the angle between them (Dunn, 2004). The cosine of 0 degrees is 1, and it is below 1 for any other different angle. It is not a judgment of magnitude orientation but orientation: two(2) vectors with the same orientation have a cosine similarity of 1, two vectors at over 90° have a similarity of 0, and two(2) vectors diametrically opposed have a similarity of -1, impartial of their magnitude. Cosine similarity is completely used in positive space, where the outcome is neatly bounded in (Raymond, & Jiawei 2003)]. Note that these bounds can apply for any number of calculations, and cosine variable is most likely used in high-dimensional positive spaces (Achtert, et a. 2007). For example, text mining
  • 10. CLUSTERING TECHNIQUES and in information retrieval, each term is globally assigned a divergent dimension, and a document is distinguished by a vector where the sum of each dimension concur to the amount of times that term appears in the document. Cosine similarity then awards a very crucial value of how similar two documents are likely to be regarding their subject matter. The skill is also used to calculate the cohesion within clusters in the point of data mining. Cosine distance is a term used for the complement in positive space. It is important to note, that it is not an exact distance metric as it does not contain the triangle inequality property and it violates the conjunction axiom; to repair the triangle inequity property while keeping the original ordering, it is required to transform to angular distance (Dunn, 2004). Non-zero dimensions need to be accounted is one of the reasons for the collaboration of cosine similarity which is very efficient to evaluate, especially for sparse vectors(Achter t, et al. 2007).
  • 11. CLUSTERING TECHNIQUES References Achtert, E. Bohm, C. Kriegel, H. P. Kröger, P. & Zimek, A. (2007). On Exploring Complex Relationships of Correlation Clusters. 19th International Conference on Scientific and Statistical Database Management Dunn, J. (2004). "Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics. Jiawei Han & Michelle Kamber (2001). Data Mining: Concepts & Techniques. Morgan Kaufmann, Raymond, T. N. & Jiawei H.(2003). Efficient and Effective Clustering Methods for Spatial Data Mining. Santiago,Chile. Morgan Kaufmann.
  翻译: