SlideShare a Scribd company logo
Computing Scientometrics in
Large-Scale Academic Search
Engines with MapReduce
Leonidas Akritidis
Panayiotis Bozanis
Department of Computer & Communication Engineering,
University of Thessaly, Greece
13th International Conference on Web Information System Engineering
WISE 2012, November 28-30, Paphos, Cyprus
Scientometrics
 Metrics evaluating the research work of a
scientist by assigning impact scores to his/her
articles.
 Usually expressed as definitions of the form:
 A scientist a is of value V if at least V of his
articles receive a score
 A researcher must author numerous qualitative
articles.
.S V
h-index
 The first and most popular of these metrics is
h-index (Hirsch, 2005).
 A researcher a has h-index h, if h of his/her Pa
articles have received at least h citations.
 This metric indicates the impact of an author.
 An author must not only produce numerous
papers;
 His/her papers must also be cited frequently.
Time-Aware h-index Variants
 The contemporary and trend h-index
(Sidiropoulos, 2006) introduced temporal
aspects in the evaluation of a scientist's work.
 They assign to each article of an author time-
decaying scores:
 Contemporary Score:
 Trend Score:
 
i
i
p
cp
c
i
P
S
Y



 1
1
pi
c
i
P
p
t
n n
S
Y






Contemporary and Trend
h-indices
 (ΔY)i: The time (in years) elapsed since the
publication of the article i.
 number of papers citing pi
 Contemporary score: The value of an article
decays over time.
 Trend Score: An article is important if it
continues to be cited in the present.
 A scientist a has contemporary h-index hc if at
least hc of his articles receive a score c cS h
4, 1, :ip
cP  
Scientometrics Computation
 Easy for small datasets
 For h-index we just need to identify the
articles of each researcher and enumerate all
their incoming citations.
 However, for large datasets the computation
becomes more complex:
 The data (authors, citations, and metadata)
do not fit in main memory.
 Tens of millions of articles and authors
Academic Search Engines
MapReduce (1)
 MapReduce: A fault tolerant framework for
distributing problems to large clusters.
 The input data is split in chunks; each chunk is
processed by a single Worker process (Map).
 Mapper outputs are written in intermediate files.
 In the sequel, the Reducers process and merge
the Mappers’ outputs.
 The data is formatted in key-value pairs.
MapReduce (2)
 The MapReduce algorithms are expressed by
writing only two functions:
 Map:
 Reduce:
 The MapReduce Jobs are deployed on top of
a distributed file system which serves files,
replicates data, and transparently addresses
hardware failures.
   1 1 2 2, ,map k v list k v
    2 2 3 3, ,reduce k list v list k v
Combiners
 An optional component standing between the
Mappers and the Reducers.
 It serves as a mini-Reducer; it merges the
values of the same keys for each Map Job.
 It reduces the data exchanged among system
nodes, thus saving bandwidth.
 Implementations
 Explicit: Declare a Combine function
 In-Mapper: merge the values of the same key
within map.
• Guaranteed execution!
Parallelizing the Problem
 Goal: Compute Scientometrics in parallel
 Input:
 Output:
 To reach our goal, we have to construct for
each author, a list of his/her articles sorted by
decreasing score:
 Then, we just iterate through the list and we
compute the desired metric value.
      1 2
1 2, , , , ,..., , Npp p
x x N xa SortedList p S p S p S 
 
   , ,ip
ip C paperID paperContent
   , ,a
xa M author metric
Methods 1,2 – Map Phase
(author, <paper,score>)
(<author,paper>, score)
 Notice: We emit the references’ authors, not
the paper’s authors
Method 1, Reduce Phase
 Reducer Input: (author, pair <paper,score>)
 Create an associative
array which stores for
each author, a list of
<paper,score> pairs.
 Sum partial paper scores
 Sort the array by
descending score.
 Compute metric.
Method 2, Reduce Phase
 Keys sorted by author
and paper (secondary
sort).
 We create an associative
array which stores for each
author, a list of
<paper,score> pairs
 Compute metric
 Reducer Input: (pair <author,paper>, score)
Method 1-C, Map-Reduce
 In-Mapper Combiner (unique keys).
 map emits author as key, and a list
of <paper,score> pairs as value.
 The Reducer merges the lists
associated with the same key.
 The list is sorted by score
Mergeuniqueauthors
Method 2-C, Map-Reduce
 In-Mapper Combiner (unique keys).
 map emits <author,paper> pairs
as key, and <score> as value.
 The Reducer merges the lists
associated with the same key.
 The list is sorted by score
Experiments
 We applied our algorithms at the CiteSeerX
dataset, an open repository comprised of 1,8
million research articles.
 We used the XML version of the dataset.
 Total input size: ~28 GB.
 Small, but the largest publicly available.
MapReduce I/O Sizes
 The methods which employ Combiners
perform reasonably better
 Method 1-C: The Mappers produce 21.7 million
key-value records (gain ~41%). Total output
size = 600MB (gain ~13% less bandwidth).
 Method 2-C: 34.2 million records and 643 MB
(~7%).
Running Times
 We also measured the running times of the
four methods on two clusters:
 A local, small cluster comprised of 8 CoreI7
processing cores.
 A commercial Web cloud infrastructure with up
to 40 processing cores.
 On the first cluster, we replicated the input
data across all nodes. On the second case, we
are not aware of the data physical location.
Running Times
 All four methods have the same computational
complexity.
 We expect timings proportional to the size of
the data exchanged among the Mappers and
the Reducers.
 This favors the Methods 1-C and 2-C which
limit data transfers via the Combiners.
Running Times – Local Cluster
 All methods scale well to
the size of the cluster
 Method 1C is the fastest
 It outperforms method
2-C by ~18%
 It outperforms method 1
by 30-35%
Running Times – Web Cluster
 We repeated the
experiment on a Web
cloud infrastructure.
 Running times between
the two clusters are not
comparable.
 Different hardware
and architecture
 Method 1-C is still the fastest
Conclusions
 We studied the problem of computing author
evaluation metrics (scientometrics) in large
academic search engines with MapReduce.
 We introduced four approaches.
 We showed that the most efficient strategy is to
create one list of <paper,score> pairs for
each unique author during the map phase.
 In this way we achieve at least 20% reduced
running times and we gain ~13% bandwidth.
Thank you!
Any Questions?
Ad

More Related Content

What's hot (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
IJET-V3I1P27
IJET-V3I1P27IJET-V3I1P27
IJET-V3I1P27
IJET - International Journal of Engineering and Techniques
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
Eric Dagobert
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
What is MapReduce ?
What is MapReduce ?What is MapReduce ?
What is MapReduce ?
ShilpaKrishna6
 
Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
Natasha Mandal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
Uday Vakalapudi
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
Liyin Tang
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
dbpublications
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
Subhas Kumar Ghosh
 
SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273
ThomsonReuters
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means Clustering
Andreina Uzcategui
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
koolkampus
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
Natasha Mandal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
Uday Vakalapudi
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
Liyin Tang
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
dbpublications
 
SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273
ThomsonReuters
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means Clustering
Andreina Uzcategui
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
koolkampus
 

Similar to Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce (20)

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
Shahbaz Sidhu
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
Shubham Bansal
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
Sitamarhi Institute of Technology
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
Genoveva Vargas-Solar
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
cis97003
cis97003cis97003
cis97003
perfj
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
cis97003
cis97003cis97003
cis97003
perfj
 
Ad

More from Leonidas Akritidis (8)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
Leonidas Akritidis
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for News
Leonidas Akritidis
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...
Leonidas Akritidis
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Leonidas Akritidis
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
Leonidas Akritidis
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
Leonidas Akritidis
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
Leonidas Akritidis
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
Leonidas Akritidis
 
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
Leonidas Akritidis
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for News
Leonidas Akritidis
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...
Leonidas Akritidis
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Leonidas Akritidis
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
Leonidas Akritidis
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
Leonidas Akritidis
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
Leonidas Akritidis
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
Leonidas Akritidis
 
Ad

Recently uploaded (20)

RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce

  • 1. Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering, University of Thessaly, Greece 13th International Conference on Web Information System Engineering WISE 2012, November 28-30, Paphos, Cyprus
  • 2. Scientometrics  Metrics evaluating the research work of a scientist by assigning impact scores to his/her articles.  Usually expressed as definitions of the form:  A scientist a is of value V if at least V of his articles receive a score  A researcher must author numerous qualitative articles. .S V
  • 3. h-index  The first and most popular of these metrics is h-index (Hirsch, 2005).  A researcher a has h-index h, if h of his/her Pa articles have received at least h citations.  This metric indicates the impact of an author.  An author must not only produce numerous papers;  His/her papers must also be cited frequently.
  • 4. Time-Aware h-index Variants  The contemporary and trend h-index (Sidiropoulos, 2006) introduced temporal aspects in the evaluation of a scientist's work.  They assign to each article of an author time- decaying scores:  Contemporary Score:  Trend Score:   i i p cp c i P S Y     1 1 pi c i P p t n n S Y      
  • 5. Contemporary and Trend h-indices  (ΔY)i: The time (in years) elapsed since the publication of the article i.  number of papers citing pi  Contemporary score: The value of an article decays over time.  Trend Score: An article is important if it continues to be cited in the present.  A scientist a has contemporary h-index hc if at least hc of his articles receive a score c cS h 4, 1, :ip cP  
  • 6. Scientometrics Computation  Easy for small datasets  For h-index we just need to identify the articles of each researcher and enumerate all their incoming citations.  However, for large datasets the computation becomes more complex:  The data (authors, citations, and metadata) do not fit in main memory.  Tens of millions of articles and authors
  • 8. MapReduce (1)  MapReduce: A fault tolerant framework for distributing problems to large clusters.  The input data is split in chunks; each chunk is processed by a single Worker process (Map).  Mapper outputs are written in intermediate files.  In the sequel, the Reducers process and merge the Mappers’ outputs.  The data is formatted in key-value pairs.
  • 9. MapReduce (2)  The MapReduce algorithms are expressed by writing only two functions:  Map:  Reduce:  The MapReduce Jobs are deployed on top of a distributed file system which serves files, replicates data, and transparently addresses hardware failures.    1 1 2 2, ,map k v list k v     2 2 3 3, ,reduce k list v list k v
  • 10. Combiners  An optional component standing between the Mappers and the Reducers.  It serves as a mini-Reducer; it merges the values of the same keys for each Map Job.  It reduces the data exchanged among system nodes, thus saving bandwidth.  Implementations  Explicit: Declare a Combine function  In-Mapper: merge the values of the same key within map. • Guaranteed execution!
  • 11. Parallelizing the Problem  Goal: Compute Scientometrics in parallel  Input:  Output:  To reach our goal, we have to construct for each author, a list of his/her articles sorted by decreasing score:  Then, we just iterate through the list and we compute the desired metric value.       1 2 1 2, , , , ,..., , Npp p x x N xa SortedList p S p S p S       , ,ip ip C paperID paperContent    , ,a xa M author metric
  • 12. Methods 1,2 – Map Phase (author, <paper,score>) (<author,paper>, score)  Notice: We emit the references’ authors, not the paper’s authors
  • 13. Method 1, Reduce Phase  Reducer Input: (author, pair <paper,score>)  Create an associative array which stores for each author, a list of <paper,score> pairs.  Sum partial paper scores  Sort the array by descending score.  Compute metric.
  • 14. Method 2, Reduce Phase  Keys sorted by author and paper (secondary sort).  We create an associative array which stores for each author, a list of <paper,score> pairs  Compute metric  Reducer Input: (pair <author,paper>, score)
  • 15. Method 1-C, Map-Reduce  In-Mapper Combiner (unique keys).  map emits author as key, and a list of <paper,score> pairs as value.  The Reducer merges the lists associated with the same key.  The list is sorted by score Mergeuniqueauthors
  • 16. Method 2-C, Map-Reduce  In-Mapper Combiner (unique keys).  map emits <author,paper> pairs as key, and <score> as value.  The Reducer merges the lists associated with the same key.  The list is sorted by score
  • 17. Experiments  We applied our algorithms at the CiteSeerX dataset, an open repository comprised of 1,8 million research articles.  We used the XML version of the dataset.  Total input size: ~28 GB.  Small, but the largest publicly available.
  • 18. MapReduce I/O Sizes  The methods which employ Combiners perform reasonably better  Method 1-C: The Mappers produce 21.7 million key-value records (gain ~41%). Total output size = 600MB (gain ~13% less bandwidth).  Method 2-C: 34.2 million records and 643 MB (~7%).
  • 19. Running Times  We also measured the running times of the four methods on two clusters:  A local, small cluster comprised of 8 CoreI7 processing cores.  A commercial Web cloud infrastructure with up to 40 processing cores.  On the first cluster, we replicated the input data across all nodes. On the second case, we are not aware of the data physical location.
  • 20. Running Times  All four methods have the same computational complexity.  We expect timings proportional to the size of the data exchanged among the Mappers and the Reducers.  This favors the Methods 1-C and 2-C which limit data transfers via the Combiners.
  • 21. Running Times – Local Cluster  All methods scale well to the size of the cluster  Method 1C is the fastest  It outperforms method 2-C by ~18%  It outperforms method 1 by 30-35%
  • 22. Running Times – Web Cluster  We repeated the experiment on a Web cloud infrastructure.  Running times between the two clusters are not comparable.  Different hardware and architecture  Method 1-C is still the fastest
  • 23. Conclusions  We studied the problem of computing author evaluation metrics (scientometrics) in large academic search engines with MapReduce.  We introduced four approaches.  We showed that the most efficient strategy is to create one list of <paper,score> pairs for each unique author during the map phase.  In this way we achieve at least 20% reduced running times and we gain ~13% bandwidth.
  翻译: