Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011

Lecture 2
September 1, 2011

Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu

Acknowledgments
Course design and slides derived from
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park

Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)

Roots in Functional Programming

Map f f f f f

Fold g g g g g

Divide and Conquer

“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine

“Big Ideas”
 Scale “out”, not “up”
 Limits of SMP and large shared-memory machines
 Move processing to the data
 Cluster have limited bandwidth
 Process data sequentially, avoid random access
 Seeks are expensive, disk throughput is reasonable
 Seamless scalability
 From the mythical man-month to the tradable machine-hour

Typical Large-Data Problem
 Iterate over a large number of records
 Compute something of interest from each
 Shuffle and sort intermediate results
 Aggregate intermediate results
 Generate final output

Key idea: provide a functional abstraction for
these two operations

(Dean and Ghemawat, OSDI 2004)

MapReduce Data Flow

Courtesy of Chuck Lam’s Hadoop In Action
(2011), pp. 45, 52

MapReduce “Runtime”
 Handles scheduling
 Assigns workers to map and reduce tasks
 Handles “data distribution”
 Moves processes to data
 Handles synchronization
 Gathers, sorts, and shuffles intermediate data
 Handles errors and faults
 Detects worker failures and restarts
 Built on a distributed file system

MapReduce
Programmers specify two functions
map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input

Data Flow
 Input → “input splits”: each a sequence of logical (K1,V1) “records”
 Map
• Each split processed by same map node
• map invoked iteratively: once per record in the split
• For each record processed, map may emit 0-N (K2,V2) pairs

 Reduce
• reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
• For each processed, reduce may emit 0-N (K3,V3) pairs
 Each reducer’s output written to a persistent file in HDFS

Input File Input File

InputSplit InputSplit InputSplit InputSplit InputSplit
InputFormat

RecordReader RecordReader RecordReader RecordReader RecordReader

Mapper Mapper Mapper Mapper Mapper

Intermediates Intermediates Intermediates Intermediates Intermediates

Source: redrawn from a slide by Cloduera, cc-licensed

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

Data Flow

 Input → “input splits”: each a sequence of logical (K1,V1) “records”
 For each split, for each record, do map(K1,V1) (multiple calls)
 Each map call may emit any number of (K2,V2) pairs (0-N)
Run-time
 Groups all values with the same key into ( K2, list(V2) )
 Determines which reducer will process this
 Copies data across network as needed for reducer
 Ensures intra-node sort of keys processed by each reducer
• No guarantee by default of inter-node total sort across reducers

“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Map(String docid, String text):
for each word w in text:
Emit(w, 1);

Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3

Courtesy of Chuck Lam’s Hadoop In
Action (2011), pp. 45, 52

Partition
 Given: map ( K1, V1 ) → list ( K2, V2 )

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

 Each distinct key (with associated values) sent to a single reducer
• Same reduce node may process multiple keys in separate reduce() calls

 Balances workload across reducers: equal number of keys to each
• Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

 Customizable
• Some keys require more computation than others
• e.g. value skew, or key-specific computation performed
• For skew, sampling can dynamically estimate distribution & set partition
• Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?

Secondary Sorting (Lin 57, White 241)
 How to output sorted bigrams (1st word, then list of 2nds)?
 What if we use word1 as the key, word 2 as the value?
 What if we use <first>--<second> as the key?
 Pattern
 Create a composite key of (first, second)
 Define a Key Comparator based on both words
• This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
 Define a partition function based only on first word
• All bigrams with the same first word go to same reducer
• How do you know when the first word changes across invocations?
 Preserve state in the reducer across invocations
• Will be called separately for each bigram, but we want to remember
the current first word across bigrams seen
 Hadoop also provides Group Comparator

Combine
 Given: map ( K1, V1 ) → list ( K2, V2 )

combine ( K2, list(V2) ) → list ( K2, V2 )

 Optional optimization
 Local aggregation to reduce network traffic
 No guarantee it will be used, how many times it will be called
 Semantics of program cannot depend on its use
 Signature: same input as reduce, same output as map
 Combine may be run repeatedly on its own output
 Lin: Associative & Commutative  combiner = reducer
• See next slide

Functional Properties
 Associative: f( a, f(b,c) ) = f( f(a,b), c )
 Grouping of operations doesn’t matter
 YES: Addition, multiplication, concatenation
 NO: division, subtraction, NAND
 NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
 Commutative: f(a,b) = f(b,a)
 Ordering of arguments doesn’t matter
 YES: addition, multiplication, NAND
 NO: division, subtraction, concatenation
 Concatenate(“a,”b”) != concatenate(“b”,a”)
 Distributive
 White (p. 32) and Lam (p. 84) mention with regard to combiners
 But really, go with associative + commutative in Lin (pp. 20, 27)

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 9 8 8
3 6

reduce reduce reduce

r1 s1 r2 s2 r3 s3

User
Program

(1) submit

Master

(2) schedule map (2) schedule reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output
files phase (on local disk) phase files

Adapted from (Dean and Ghemawat, OSDI 2004)

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178

Shuffle and 2 Sorts

 As map emits values, local sorting
runs in tandem (1st sort)
 Combine is optionally called
0..N times for local aggregation
on sorted (K2, list(V2)) tuples (more sorting of output)
 Partition determines which (logical) reducer Rj each key will go to
 Node’s TaskTracker tells JobTracker it has keys for Rj
 JobTracker determines node to run Rj based on data locality
 When local map/combine/sort finishes, sends data to Rj’s node
 Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
 For each (K, list(V)) tuple in merged output, call reduce(…)

Distributed File System
 Don’t move data… move computation to the data!
 Store data on the local disks of nodes in the cluster
 Start up the workers on the node that has the data local
 Why?
 Not enough RAM to hold all the data in memory
 Disk access is slow, but disk throughput is reasonable
 A distributed file system is the answer
 GFS (Google File System) for Google’s MapReduce
 HDFS (Hadoop Distributed File System) for Hadoop

GFS: Assumptions
 Commodity hardware over “exotic” hardware
 Scale “out”, not “up”
 High component failure rates
 Inexpensive commodity components fail all the time
 “Modest” number of huge files
 Multi-gigabyte files are common, if not encouraged
 Files are write-once, mostly appended to
 Perhaps concurrently
 Large streaming reads over random access
 High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design Decisions
 Files stored as chunks
 Fixed size (64MB)
 Reliability through replication
 Each chunk replicated across 3+ chunkservers
 Single master to coordinate access, keep metadata
 Simple centralized management
 No data caching
 Little benefit due to large datasets, streaming reads
 Simplify the API
 Push some of the issues onto the client (e.g., data layout)

HDFS = GFS clone (same basic ideas)

Basic Cluster Components
 1 “Manager” node (can be split onto 2 nodes)
 Namenode (NN)
 Jobtracker (JT)
 1-N “Worker” nodes
 Tasktracker (TT)
 Datanode (DN)
 Optional Secondary Namenode
 Periodic backups of Namenode in case of failure

Hadoop Architecture

 Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25

Namenode Responsibilities
 Managing the file system namespace:
 Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
 Coordinating file operations:
 Directs clients to datanodes for reads and writes
 No data is moved through the namenode
 Maintaining overall health:
 Periodic communication with the datanodes
 Block re-replication and rebalancing
 Garbage collection

Putting everything together…

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node

Anatomy of a Job
 MapReduce program in Hadoop = Hadoop job
 Jobs are divided into map and reduce tasks (+ more!)
 An instance of running a task is called a task attempt
 Multiple jobs can be composed into a workflow
 Job submission process
 Client (i.e., driver program) creates a job, configures it, and
submits it to job tracker
 JobClient computes input splits (on client end)
 Job data (jar, configuration XML) are sent to JobTracker
 JobTracker puts job data in shared location, enqueues tasks
 TaskTrackers poll for tasks
 Off to the races…

Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
 Hadoop 0.19 and earlier had “old API”
 Hadoop 0.21 and forward has “new API”
 Hadoop 0.20 has both!
 Old API most stable, but deprecated
 Current books use old API predominantly, but discuss changes
• Example code using new API available online from publisher
 Some old API classes/methods not yet ported to new API
 Cloud9 uses both, and you can too

Old API
 Mapper (interface)
 void map(K1 key, V1 value, OutputCollector<K2, V2> output,
Reporter reporter)
 void configure(JobConf job)
 void close() throws IOException
 Reducer/Combiner
 void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3,V3> output, Reporter reporter)
 void configure(JobConf job)
 void close() throws IOException
 Partitioner
 void getPartition(K2 key, V2 value, int numPartitions)

New API
 org.apache.hadoop.mapred now deprecated; instead use
org.apache.hadoop.mapreduce &
org.apache.hadoop.mapreduce.lib
 Mapper, Reducer now abstract classes, not interfaces
 Use Context instead of OutputCollector and Reporter
 Context.write(), not OutputCollector.collect()
 Reduce takes value list as Iterable, not Iterator
 Can use java’s foreach syntax for iterating
 Can throw InterruptedException as well as IOException
 JobConf & JobClient replaced by Configuration & Job

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Recommended

More Related Content

What's hot (20)

Similar to Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

More from Matthew Lease (20)

Recently uploaded (20)

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)