SlideShare a Scribd company logo
CSMR: A Scalable Algorithm for 
Text Clustering with Cosine 
Similarity and MapReduce 
Giannakouris – Salalidis Victor - Undergraduate Student 
Plerou Antonia - PhD Candidate 
Sioutas Spyros - Associate Professor
Introduction 
• Big Data: Massive amount of data as a result of the huge 
rate of growth 
• Big Data need to be faced in various domains: Business 
Intelligence, Bioinformatics, Social Media Analytics etc. 
• Text Mining: Classification/Clustering in digital libraries, 
e-mail, Sentiment Analysis on Social Media 
• CSMR: Performs pairwise text similarity, represents text 
data in a vector space and measures similarity in parallel 
manner using MapReduce
Background 
• Vector Space Model: An algebraic model for representing 
text documents as vectors 
• Efficient method for text similarity measurement
TF-IDF 
• Term Frequency – Inverse Document Frequency 
• A numerical statistic that reflects the significance of a 
term in a corpus of documents 
• Usually used in search engines, text mining, text 
similarity in the vector space 
푇퐹 × 퐼퐷퐹 = 
푛푖,푗 
푡 ∈ 푑푗 
× 푙표푔 
|퐷| 
|푑 ∈ 퐷: 푡 ∈ 푑|
Cosine Similarity 
• Cosine Similarity: A measure of similarity between two 
documents represented as vector 
• Measuring of the angle between two vectors 
A  B A  
B 
  
1 
1 2 2 
A  
B 
1 1 
cos(A,B) 
|| A|| || B|| 
( ) ( ) 
n 
i i 
n 
i 
i i 
i i 
 
  
 
 
Hadoop 
• Framework developed by Apache 
• Large-Scale Data Processing and Analytics 
• Scalable and parallel processing of data on large 
computer clusters using MapReduce 
• Runs on commodity, low-end hardware 
• Main Components: HDFS (Hadoop Distributed File 
System), MapReduce 
• Currently used by: Adobe, Yahoo!, Amazon, eBay, 
Facebook and many other companies
MapReduce 
• Programming Paradigm running on Apache Hadoop 
• The main component of Hadoop 
• Useful for processing of large data-sets 
• Breaks the data into key-value pairs 
• Model derived from map and reduce functions of 
Functional Programming 
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR 
• The purposed method, CSMR combines all the above 
mentioned techniques 
• Scalable Algorithm for text clustering using MapReduce model 
• Applies MR model on TF-IDF and Cosine Similarity 
• 4 Phases: 
1. Word Counting 
2. Text Vectorization using term frequencies 
3. Apply TF-IDF on document vectors 
4. Cosine Similarity Measurement
Phase 1: Word Counting 
Algorithm 1: Word Count 
1: class Mapper 
2: method Map( document ) 
3: for each term ∈ document 
4: write ( ( term , docId ) , 1 ) 
5: 
6: class Reducer 
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 
8: sum = 0 
9: for each one ∈ ones do 
10: sum = sum +1 
11: return ( ( term , docId ) , o ) 
12: 
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency 
Algorithm 2: Term Frequency 
1: class Mapper 
2: method Map( ( term , docId ) , o ) 
3: for each element ∈ ( term , docId ) 
4: write ( docId, ( term, o ) ) 
5: 
6: class Reducer 
7: method Reduce( docId, (term, o) ) 
8: N = 0 
9: for each tuple ∈ ( term, o ) do 
10: N = N + o 
return ( (docId, N), (term, o) )
Phase 3: TF-IDF 
Algorithm 3: Tf-Idf 
1: class Mapper 
2: method Map( ( docId , N ), ( term , o ) ) 
3: for each element ∈ ( term , o ) 
4: write ( term, ( docId, o, N ) ) 
5: 
6: class Reducer 
7: method Reduce( term, ( docId , o , N ) ) 
8: n = 0 
9: for each element ∈ ( docId , o , N ) do 
10: n = n + 1 
11: tf = o / N 
12: idf = log|D| /(1n) 
13: return ( docId, ( term , tf×idf ) ) 
14: 
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine Similarity 
Algorithm 4: Cosine Similarity 
1: class Mapper 
2: method Map( docs ) 
3: n = docs.length 
4: 
5: for i = 0 to docs.length 
6: for j = i+1 to docs.length 
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 
8: 
9: class Reducer 
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 
11: A = docA.tfidf 
12: B = docB.tfidf 
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram 
Map 
Doc1,Doc2 
[Doc1 TF-IDF], [Doc2 TF-IDF] 
Doc1,Doc3 
[Doc1 TF-IDF], [Doc3 TF-IDF] 
Doc1,Doc4 
Input [Doc1 TF-IDF], [Doc4 TF-IDF] 
Output 
Doc4,Doc10 
[Doc4 TF-IDF], [Doc10 TF-IDF] 
DocM,DocN 
[DocM TF-IDF], [DocN TF-IDF] 
Reduce 
Doc1,Doc3 
Cosine(Doc1, Doc3) 
Doc1,Doc4 
Cosine(Doc1 ,Doc4) 
Doc4,Doc10 
Cosine(Doc4, Doc10) 
DocM,DocN 
Cosine(DocM, DocN) 
Doc1,Doc2 
Cosine(Doc1, Doc2)
Conclusions & Future Work 
• Finalized proposed method 
• Implementation of the method 
• Experimental tests on real data and computer clusters 
• Deployment of an open-source project 
• Additional implementation using more efficient tools such 
as Apache Spark and Scala 
• Publication of test results
Ad

More Related Content

What's hot (20)

Ir 08
Ir   08Ir   08
Ir 08
Mohammed Romi
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
Ujjawal
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
Yash Darak
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
factscomputersoftware
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Lect4
Lect4Lect4
Lect4
sumit621
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Ghost
GhostGhost
Ghost
Jhih-Ming Chen
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
Ujjawal
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
Yash Darak
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 

Viewers also liked (20)

OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
Xudong Brandon Liang
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
Prakhar Dhama
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
Nicolas Robinson-Garcia
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Seattle DAML meetup
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
DATAVERSITY
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
Prakhar Dhama
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
Nicolas Robinson-Garcia
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Seattle DAML meetup
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
DATAVERSITY
 
Ad

Similar to CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
Mark Levy
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Hadoop
HadoopHadoop
Hadoop
Bhushan Kulkarni
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Matthäus Zloch
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
Yoshitomo Matsubara
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Unit 2
Unit 2Unit 2
Unit 2
vishal choudhary
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Information technology Researhc Tools in IT
Information technology Researhc Tools in ITInformation technology Researhc Tools in IT
Information technology Researhc Tools in IT
AhamedShibly
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
Mvenkatarao
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
Mark Levy
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Matthäus Zloch
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
Yoshitomo Matsubara
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Information technology Researhc Tools in IT
Information technology Researhc Tools in ITInformation technology Researhc Tools in IT
Information technology Researhc Tools in IT
AhamedShibly
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
Mvenkatarao
 
Ad

Recently uploaded (20)

web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
Professional Certificate in Applied AI and Machine Learning
Professional Certificate in Applied AI and Machine LearningProfessional Certificate in Applied AI and Machine Learning
Professional Certificate in Applied AI and Machine Learning
Nafisur Ahmed
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf
LucaMariaPesando1
 
L7-SL_en_Slides - LLMsIntroduction .pptx
L7-SL_en_Slides - LLMsIntroduction .pptxL7-SL_en_Slides - LLMsIntroduction .pptx
L7-SL_en_Slides - LLMsIntroduction .pptx
kenryostanikegbo
 
chapter-6 (1).pdf immunology innate immunity
chapter-6 (1).pdf immunology innate immunitychapter-6 (1).pdf immunology innate immunity
chapter-6 (1).pdf immunology innate immunity
bedadadenbal50
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf
AffinityCore
 
Time series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptxTime series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptx
AsmaaMahmoud89
 
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptxDEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
f8jyv28tjr
 
Group Presentation - Cyclic Redundancy Checks.pptx
Group Presentation - Cyclic Redundancy Checks.pptxGroup Presentation - Cyclic Redundancy Checks.pptx
Group Presentation - Cyclic Redundancy Checks.pptx
vimbaimapfumo25
 
Hootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdfHootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdf
lionardoadityabagask
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
ijitcs
 
presentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptxpresentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptx
GersonVillatoro4
 
Nghien cuu cua GlobalFindings2024 ve nam
Nghien cuu cua GlobalFindings2024 ve namNghien cuu cua GlobalFindings2024 ve nam
Nghien cuu cua GlobalFindings2024 ve nam
PaAlu2
 
FT Partners Research - FinTech in Africa-2.pdf
FT Partners Research - FinTech in Africa-2.pdfFT Partners Research - FinTech in Africa-2.pdf
FT Partners Research - FinTech in Africa-2.pdf
Obinna8
 
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptxCRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
monarisdaralina1
 
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
Professional Certificate in Applied AI and Machine Learning
Professional Certificate in Applied AI and Machine LearningProfessional Certificate in Applied AI and Machine Learning
Professional Certificate in Applied AI and Machine Learning
Nafisur Ahmed
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf
LucaMariaPesando1
 
L7-SL_en_Slides - LLMsIntroduction .pptx
L7-SL_en_Slides - LLMsIntroduction .pptxL7-SL_en_Slides - LLMsIntroduction .pptx
L7-SL_en_Slides - LLMsIntroduction .pptx
kenryostanikegbo
 
chapter-6 (1).pdf immunology innate immunity
chapter-6 (1).pdf immunology innate immunitychapter-6 (1).pdf immunology innate immunity
chapter-6 (1).pdf immunology innate immunity
bedadadenbal50
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf
AffinityCore
 
Time series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptxTime series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptx
AsmaaMahmoud89
 
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptxDEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
DEWDHDIEFHIFHIHGIERHFIHIM SC ID (2).pptx
f8jyv28tjr
 
Group Presentation - Cyclic Redundancy Checks.pptx
Group Presentation - Cyclic Redundancy Checks.pptxGroup Presentation - Cyclic Redundancy Checks.pptx
Group Presentation - Cyclic Redundancy Checks.pptx
vimbaimapfumo25
 
Hootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdfHootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdf
lionardoadityabagask
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
ijitcs
 
presentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptxpresentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptx
GersonVillatoro4
 
Nghien cuu cua GlobalFindings2024 ve nam
Nghien cuu cua GlobalFindings2024 ve namNghien cuu cua GlobalFindings2024 ve nam
Nghien cuu cua GlobalFindings2024 ve nam
PaAlu2
 
FT Partners Research - FinTech in Africa-2.pdf
FT Partners Research - FinTech in Africa-2.pdfFT Partners Research - FinTech in Africa-2.pdf
FT Partners Research - FinTech in Africa-2.pdf
Obinna8
 
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptxCRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
monarisdaralina1
 
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
 

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • 1. CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce Giannakouris – Salalidis Victor - Undergraduate Student Plerou Antonia - PhD Candidate Sioutas Spyros - Associate Professor
  • 2. Introduction • Big Data: Massive amount of data as a result of the huge rate of growth • Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc. • Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media • CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
  • 3. Background • Vector Space Model: An algebraic model for representing text documents as vectors • Efficient method for text similarity measurement
  • 4. TF-IDF • Term Frequency – Inverse Document Frequency • A numerical statistic that reflects the significance of a term in a corpus of documents • Usually used in search engines, text mining, text similarity in the vector space 푇퐹 × 퐼퐷퐹 = 푛푖,푗 푡 ∈ 푑푗 × 푙표푔 |퐷| |푑 ∈ 퐷: 푡 ∈ 푑|
  • 5. Cosine Similarity • Cosine Similarity: A measure of similarity between two documents represented as vector • Measuring of the angle between two vectors A  B A  B   1 1 2 2 A  B 1 1 cos(A,B) || A|| || B|| ( ) ( ) n i i n i i i i i      
  • 6. Hadoop • Framework developed by Apache • Large-Scale Data Processing and Analytics • Scalable and parallel processing of data on large computer clusters using MapReduce • Runs on commodity, low-end hardware • Main Components: HDFS (Hadoop Distributed File System), MapReduce • Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
  • 7. MapReduce • Programming Paradigm running on Apache Hadoop • The main component of Hadoop • Useful for processing of large data-sets • Breaks the data into key-value pairs • Model derived from map and reduce functions of Functional Programming • Every MR program constitutes of Mappers and Reducers
  • 9. CSMR • The purposed method, CSMR combines all the above mentioned techniques • Scalable Algorithm for text clustering using MapReduce model • Applies MR model on TF-IDF and Cosine Similarity • 4 Phases: 1. Word Counting 2. Text Vectorization using term frequencies 3. Apply TF-IDF on document vectors 4. Cosine Similarity Measurement
  • 10. Phase 1: Word Counting Algorithm 1: Word Count 1: class Mapper 2: method Map( document ) 3: for each term ∈ document 4: write ( ( term , docId ) , 1 ) 5: 6: class Reducer 7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0 9: for each one ∈ ones do 10: sum = sum +1 11: return ( ( term , docId ) , o ) 12: 13: /* { o ∈ N : the number of occurrences } */
  • 11. Phase 2: Term Frequency Algorithm 2: Term Frequency 1: class Mapper 2: method Map( ( term , docId ) , o ) 3: for each element ∈ ( term , docId ) 4: write ( docId, ( term, o ) ) 5: 6: class Reducer 7: method Reduce( docId, (term, o) ) 8: N = 0 9: for each tuple ∈ ( term, o ) do 10: N = N + o return ( (docId, N), (term, o) )
  • 12. Phase 3: TF-IDF Algorithm 3: Tf-Idf 1: class Mapper 2: method Map( ( docId , N ), ( term , o ) ) 3: for each element ∈ ( term , o ) 4: write ( term, ( docId, o, N ) ) 5: 6: class Reducer 7: method Reduce( term, ( docId , o , N ) ) 8: n = 0 9: for each element ∈ ( docId , o , N ) do 10: n = n + 1 11: tf = o / N 12: idf = log|D| /(1n) 13: return ( docId, ( term , tf×idf ) ) 14: 15: /* Where |D| is the number of documents in the corpus */
  • 13. Phase 4: Cosine Similarity Algorithm 4: Cosine Similarity 1: class Mapper 2: method Map( docs ) 3: n = docs.length 4: 5: for i = 0 to docs.length 6: for j = i+1 to docs.length 7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 8: 9: class Reducer 10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 11: A = docA.tfidf 12: B = docB.tfidf 13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 14: return ( (docId_A, docId_B), cosine )
  • 14. Phase 4: Diagram Map Doc1,Doc2 [Doc1 TF-IDF], [Doc2 TF-IDF] Doc1,Doc3 [Doc1 TF-IDF], [Doc3 TF-IDF] Doc1,Doc4 Input [Doc1 TF-IDF], [Doc4 TF-IDF] Output Doc4,Doc10 [Doc4 TF-IDF], [Doc10 TF-IDF] DocM,DocN [DocM TF-IDF], [DocN TF-IDF] Reduce Doc1,Doc3 Cosine(Doc1, Doc3) Doc1,Doc4 Cosine(Doc1 ,Doc4) Doc4,Doc10 Cosine(Doc4, Doc10) DocM,DocN Cosine(DocM, DocN) Doc1,Doc2 Cosine(Doc1, Doc2)
  • 15. Conclusions & Future Work • Finalized proposed method • Implementation of the method • Experimental tests on real data and computer clusters • Deployment of an open-source project • Additional implementation using more efficient tools such as Apache Spark and Scala • Publication of test results
  翻译: