Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce

Computing Scientometrics in
Large-Scale Academic Search
Engines with MapReduce
Leonidas Akritidis
Panayiotis Bozanis
Department of Computer & Communication Engineering,
University of Thessaly, Greece
13th International Conference on Web Information System Engineering
WISE 2012, November 28-30, Paphos, Cyprus

Scientometrics
 Metrics evaluating the research work of a
scientist by assigning impact scores to his/her
articles.
 Usually expressed as definitions of the form:
 A scientist a is of value V if at least V of his
articles receive a score
 A researcher must author numerous qualitative
articles.
.S V

h-index
 The first and most popular of these metrics is
h-index (Hirsch, 2005).
 A researcher a has h-index h, if h of his/her Pa
articles have received at least h citations.
 This metric indicates the impact of an author.
 An author must not only produce numerous
papers;
 His/her papers must also be cited frequently.

Time-Aware h-index Variants
 The contemporary and trend h-index
(Sidiropoulos, 2006) introduced temporal
aspects in the evaluation of a scientist's work.
 They assign to each article of an author time-
decaying scores:
 Contemporary Score:
 Trend Score:
 
i
i
p
cp
c
i
P
S
Y



 1
1
pi
c
i
P
p
t
n n
S
Y







Contemporary and Trend
h-indices
 (ΔY)i: The time (in years) elapsed since the
publication of the article i.
 number of papers citing pi
 Contemporary score: The value of an article
decays over time.
 Trend Score: An article is important if it
continues to be cited in the present.
 A scientist a has contemporary h-index hc if at
least hc of his articles receive a score c cS h
4, 1, :ip
cP  

Scientometrics Computation
 Easy for small datasets
 For h-index we just need to identify the
articles of each researcher and enumerate all
their incoming citations.
 However, for large datasets the computation
becomes more complex:
 The data (authors, citations, and metadata)
do not fit in main memory.
 Tens of millions of articles and authors

MapReduce (1)
 MapReduce: A fault tolerant framework for
distributing problems to large clusters.
 The input data is split in chunks; each chunk is
processed by a single Worker process (Map).
 Mapper outputs are written in intermediate files.
 In the sequel, the Reducers process and merge
the Mappers’ outputs.
 The data is formatted in key-value pairs.

MapReduce (2)
 The MapReduce algorithms are expressed by
writing only two functions:
 Map:
 Reduce:
 The MapReduce Jobs are deployed on top of
a distributed file system which serves files,
replicates data, and transparently addresses
hardware failures.
   1 1 2 2, ,map k v list k v
    2 2 3 3, ,reduce k list v list k v

Combiners
 An optional component standing between the
Mappers and the Reducers.
 It serves as a mini-Reducer; it merges the
values of the same keys for each Map Job.
 It reduces the data exchanged among system
nodes, thus saving bandwidth.
 Implementations
 Explicit: Declare a Combine function
 In-Mapper: merge the values of the same key
within map.
• Guaranteed execution!

Parallelizing the Problem
 Goal: Compute Scientometrics in parallel
 Input:
 Output:
 To reach our goal, we have to construct for
each author, a list of his/her articles sorted by
decreasing score:
 Then, we just iterate through the list and we
compute the desired metric value.
      1 2
1 2, , , , ,..., , Npp p
x x N xa SortedList p S p S p S 
 
   , ,ip
ip C paperID paperContent
   , ,a
xa M author metric

Methods 1,2 – Map Phase
(author, <paper,score>)
(<author,paper>, score)
 Notice: We emit the references’ authors, not
the paper’s authors

Method 1, Reduce Phase
 Reducer Input: (author, pair <paper,score>)
 Create an associative
array which stores for
each author, a list of
<paper,score> pairs.
 Sum partial paper scores
 Sort the array by
descending score.
 Compute metric.

Method 2, Reduce Phase
 Keys sorted by author
and paper (secondary
sort).
 We create an associative
array which stores for each
author, a list of
<paper,score> pairs
 Compute metric
 Reducer Input: (pair <author,paper>, score)

Method 1-C, Map-Reduce
 In-Mapper Combiner (unique keys).
 map emits author as key, and a list
of <paper,score> pairs as value.
 The Reducer merges the lists
associated with the same key.
 The list is sorted by score
Mergeuniqueauthors

Method 2-C, Map-Reduce
 In-Mapper Combiner (unique keys).
 map emits <author,paper> pairs
as key, and <score> as value.
 The Reducer merges the lists
associated with the same key.
 The list is sorted by score

Experiments
 We applied our algorithms at the CiteSeerX
dataset, an open repository comprised of 1,8
million research articles.
 We used the XML version of the dataset.
 Total input size: ~28 GB.
 Small, but the largest publicly available.

MapReduce I/O Sizes
 The methods which employ Combiners
perform reasonably better
 Method 1-C: The Mappers produce 21.7 million
key-value records (gain ~41%). Total output
size = 600MB (gain ~13% less bandwidth).
 Method 2-C: 34.2 million records and 643 MB
(~7%).

Running Times
 We also measured the running times of the
four methods on two clusters:
 A local, small cluster comprised of 8 CoreI7
processing cores.
 A commercial Web cloud infrastructure with up
to 40 processing cores.
 On the first cluster, we replicated the input
data across all nodes. On the second case, we
are not aware of the data physical location.

Running Times
 All four methods have the same computational
complexity.
 We expect timings proportional to the size of
the data exchanged among the Mappers and
the Reducers.
 This favors the Methods 1-C and 2-C which
limit data transfers via the Combiners.

Running Times – Local Cluster
 All methods scale well to
the size of the cluster
 Method 1C is the fastest
 It outperforms method
2-C by ~18%
 It outperforms method 1
by 30-35%

Running Times – Web Cluster
 We repeated the
experiment on a Web
cloud infrastructure.
 Running times between
the two clusters are not
comparable.
 Different hardware
and architecture
 Method 1-C is still the fastest

Conclusions
 We studied the problem of computing author
evaluation metrics (scientometrics) in large
academic search engines with MapReduce.
 We introduced four approaches.
 We showed that the most efficient strategy is to
create one list of <paper,score> pairs for
each unique author during the map phase.
 In this way we achieve at least 20% reduced
running times and we gain ~13% bandwidth.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce

Recommended

More Related Content

What's hot (20)

Similar to Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce (20)

More from Leonidas Akritidis (8)

Recently uploaded (20)

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce