SlideShare a Scribd company logo
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 2
                  September 1, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Roots in Functional Programming




   Map       f   f   f   f   f




   Fold      g   g   g   g   g
Divide and Conquer

                 “Work”
                                       Partition


        w1         w2         w3

      “worker”   “worker”   “worker”


         r1         r2         r3




                 “Result”              Combine
MapReduce
“Big Ideas”
    Scale “out”, not “up”
        Limits of SMP and large shared-memory machines
    Move processing to the data
        Cluster have limited bandwidth
    Process data sequentially, avoid random access
        Seeks are expensive, disk throughput is reasonable
    Seamless scalability
        From the mythical man-month to the tradable machine-hour
Typical Large-Data Problem
          Iterate over a large number of records
          Compute something of interest from each
          Shuffle and sort intermediate results
          Aggregate intermediate results
          Generate final output


             Key idea: provide a functional abstraction for
             these two operations




(Dean and Ghemawat, OSDI 2004)
MapReduce Data Flow




Courtesy of Chuck Lam’s Hadoop In Action
(2011), pp. 45, 52
MapReduce “Runtime”
   Handles scheduling
       Assigns workers to map and reduce tasks
   Handles “data distribution”
       Moves processes to data
   Handles synchronization
       Gathers, sorts, and shuffles intermediate data
   Handles errors and faults
       Detects worker failures and restarts
   Built on a distributed file system
MapReduce
Programmers specify two functions
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input


Data Flow
      Input → “input splits”: each a sequence of logical (K1,V1) “records”
      Map
        • Each split processed by same map node
        • map invoked iteratively: once per record in the split
        • For each record processed, map may emit 0-N (K2,V2) pairs

      Reduce
        • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
        • For each processed, reduce may emit 0-N (K3,V3) pairs
      Each reducer’s output written to a persistent file in HDFS
Input File                                   Input File




                  InputSplit              InputSplit     InputSplit       InputSplit                InputSplit
  InputFormat




                RecordReader           RecordReader     RecordReader    RecordReader            RecordReader




                   Mapper                   Mapper         Mapper          Mapper                    Mapper




                Intermediates           Intermediates   Intermediates   Intermediates           Intermediates




Source: redrawn from a slide by Cloduera, cc-licensed
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

Data Flow




     Input → “input splits”: each a sequence of logical (K1,V1) “records”
     For each split, for each record, do map(K1,V1)       (multiple calls)
     Each map call may emit any number of (K2,V2) pairs             (0-N)
Run-time
     Groups all values with the same key into ( K2, list(V2) )
     Determines which reducer will process this
     Copies data across network as needed for reducer
     Ensures intra-node sort of keys processed by each reducer
       • No guarantee by default of inter-node total sort across reducers
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
k1 v1   k2 v2   k3 v3    k4 v4   k5 v5    k6 v6




 map                map                    map                map


a 1    b 2        c 3     c 6           a 5   c 2           b 7   c 8
      Shuffle and Sort: aggregate values by keys
             a    1 5             b     2 7           c     2 3 6 8




         reduce             reduce                 reduce


          r1 s1                 r2 s2               r3 s3




                                                                        Courtesy of Chuck Lam’s Hadoop In
                                                                        Action (2011), pp. 45, 52
Partition
   Given:     map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

       Each distinct key (with associated values) sent to a single reducer
         • Same reduce node may process multiple keys in separate reduce() calls

       Balances workload across reducers: equal number of keys to each
         • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

       Customizable
         • Some keys require more computation than others
             • e.g. value skew, or key-specific computation performed
             • For skew, sampling can dynamically estimate distribution & set partition
         • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
Secondary Sorting (Lin 57, White 241)
    How to output sorted bigrams (1st word, then list of 2nds)?
        What if we use word1 as the key, word 2 as the value?
        What if we use <first>--<second> as the key?
    Pattern
        Create a composite key of (first, second)
        Define a Key Comparator based on both words
          • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
        Define a partition function based only on first word
          • All bigrams with the same first word go to same reducer
          • How do you know when the first word changes across invocations?
        Preserve state in the reducer across invocations
          • Will be called separately for each bigram, but we want to remember
            the current first word across bigrams seen
        Hadoop also provides Group Comparator
Combine
   Given:      map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

combine ( K2, list(V2) ) → list ( K2, V2 )

   Optional optimization
       Local aggregation to reduce network traffic
       No guarantee it will be used, how many times it will be called
       Semantics of program cannot depend on its use
   Signature: same input as reduce, same output as map
       Combine may be run repeatedly on its own output
       Lin: Associative & Commutative  combiner = reducer
         • See next slide
Functional Properties
    Associative: f( a, f(b,c) ) = f( f(a,b), c )
        Grouping of operations doesn’t matter
        YES: Addition, multiplication, concatenation
        NO: division, subtraction, NAND
        NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
    Commutative: f(a,b) = f(b,a)
        Ordering of arguments doesn’t matter
        YES: addition, multiplication, NAND
        NO: division, subtraction, concatenation
        Concatenate(“a,”b”) != concatenate(“b”,a”)
    Distributive
        White (p. 32) and Lam (p. 84) mention with regard to combiners
        But really, go with associative + commutative in Lin (pp. 20, 27)
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6




  map                   map                    map                   map


a 1    b 2           c 3     c 6           a 5     c 2             b 7   c 8

 combine              combine               combine                 combine



a 1    b 2                 c 9             a 5     c 2             b 7   c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8 8
                                                                     3 6




         reduce                   reduce                  reduce


             r1 s1                 r2 s2                   r3 s3
User
                                                                     Program

                                                                         (1) submit


                                                                     Master

                                                   (2) schedule map        (2) schedule reduce


                                          worker
                     split 0
                                                                                                      (6) write   output
                     split 1                                            (5) remote read     worker
                               (3) read                                                                            file 0
                     split 2                       (4) local write
                                          worker
                     split 3
                     split 4                                                                                      output
                                                                                            worker
                                                                                                                   file 1

                                          worker


                     Input                 Map            Intermediate files                 Reduce               Output
                      files               phase             (on local disk)                   phase                files




Adapted from (Dean and Ghemawat, OSDI 2004)
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178


    Shuffle and 2 Sorts




   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Distributed File System
    Don’t move data… move computation to the data!
        Store data on the local disks of nodes in the cluster
        Start up the workers on the node that has the data local
    Why?
        Not enough RAM to hold all the data in memory
        Disk access is slow, but disk throughput is reasonable
    A distributed file system is the answer
        GFS (Google File System) for Google’s MapReduce
        HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions
           Commodity hardware over “exotic” hardware
                 Scale “out”, not “up”
           High component failure rates
                 Inexpensive commodity components fail all the time
           “Modest” number of huge files
                 Multi-gigabyte files are common, if not encouraged
           Files are write-once, mostly appended to
                 Perhaps concurrently
           Large streaming reads over random access
                 High sustained throughput over low latency




GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
   Files stored as chunks
        Fixed size (64MB)
   Reliability through replication
        Each chunk replicated across 3+ chunkservers
   Single master to coordinate access, keep metadata
        Simple centralized management
   No data caching
        Little benefit due to large datasets, streaming reads
   Simplify the API
        Push some of the issues onto the client (e.g., data layout)

    HDFS = GFS clone (same basic ideas)
Basic Cluster Components
   1 “Manager” node (can be split onto 2 nodes)
       Namenode (NN)
       Jobtracker (JT)
   1-N “Worker” nodes
       Tasktracker (TT)
       Datanode (DN)
   Optional Secondary Namenode
       Periodic backups of Namenode in case of failure
Hadoop Architecture




   Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
Namenode Responsibilities
   Managing the file system namespace:
       Holds file/directory structure, metadata, file-to-block mapping,
        access permissions, etc.
   Coordinating file operations:
       Directs clients to datanodes for reads and writes
       No data is moved through the namenode
   Maintaining overall health:
       Periodic communication with the datanodes
       Block re-replication and rebalancing
       Garbage collection
Putting everything together…


                            namenode                 job submission node


                    namenode daemon                         jobtracker




          tasktracker                     tasktracker                      tasktracker

       datanode daemon                 datanode daemon               datanode daemon

        Linux file system               Linux file system                Linux file system

                        …                               …                                …
          slave node                      slave node                       slave node
Anatomy of a Job
   MapReduce program in Hadoop = Hadoop job
       Jobs are divided into map and reduce tasks (+ more!)
       An instance of running a task is called a task attempt
       Multiple jobs can be composed into a workflow
   Job submission process
       Client (i.e., driver program) creates a job, configures it, and
        submits it to job tracker
       JobClient computes input splits (on client end)
       Job data (jar, configuration XML) are sent to JobTracker
       JobTracker puts job data in shared location, enqueues tasks
       TaskTrackers poll for tasks
       Off to the races…
Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
   Hadoop 0.19 and earlier had “old API”
   Hadoop 0.21 and forward has “new API”
   Hadoop 0.20 has both!
       Old API most stable, but deprecated
       Current books use old API predominantly, but discuss changes
         • Example code using new API available online from publisher
       Some old API classes/methods not yet ported to new API
       Cloud9 uses both, and you can too
Old API
   Mapper (interface)
       void map(K1 key, V1 value, OutputCollector<K2, V2> output,
        Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Reducer/Combiner
       void reduce(K2 key, Iterator<V2> values,
        OutputCollector<K3,V3> output, Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Partitioner
       void getPartition(K2 key, V2 value, int numPartitions)
New API
   org.apache.hadoop.mapred now deprecated; instead use
    org.apache.hadoop.mapreduce &
    org.apache.hadoop.mapreduce.lib
   Mapper, Reducer now abstract classes, not interfaces
   Use Context instead of OutputCollector and Reporter
       Context.write(), not OutputCollector.collect()
   Reduce takes value list as Iterable, not Iterator
       Can use java’s foreach syntax for iterating
   Can throw InterruptedException as well as IOException
   JobConf & JobClient replaced by Configuration & Job
Ad

More Related Content

What's hot (20)

Cics faqs
Cics faqsCics faqs
Cics faqs
kapa rohit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Cics cheat sheet
Cics cheat sheetCics cheat sheet
Cics cheat sheet
Rafi Shaik
 
Scalding
ScaldingScalding
Scalding
Mario Pastorelli
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanual
iravi9
 
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5
Takao Wada
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
Dmitriy Lyubimov
 
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++
Alejandro Cosin Ayerbe
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
MapReduce
MapReduceMapReduce
MapReduce
Zuhair khayyat
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
Cliff Click
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
syeda madeha azmat
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
vishal choudhary
 
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tables
hholzgra
 
Lecture 3: Storage and Variables
Lecture 3: Storage and VariablesLecture 3: Storage and Variables
Lecture 3: Storage and Variables
Eelco Visser
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
Kyong-Ha Lee
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...
Michael Nelson
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Cics cheat sheet
Cics cheat sheetCics cheat sheet
Cics cheat sheet
Rafi Shaik
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanual
iravi9
 
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5
Takao Wada
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
Dmitriy Lyubimov
 
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++
Alejandro Cosin Ayerbe
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
Cliff Click
 
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tables
hholzgra
 
Lecture 3: Storage and Variables
Lecture 3: Storage and VariablesLecture 3: Storage and Variables
Lecture 3: Storage and Variables
Eelco Visser
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
Kyong-Ha Lee
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...
Michael Nelson
 

Similar to Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
Jim Roepcke
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Steven Francia
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
07 2
07 207 2
07 2
a_b_g
 
Spark meets Telemetry
Spark meets TelemetrySpark meets Telemetry
Spark meets Telemetry
Roberto Agostino Vitillo
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
Evgeny Benediktov
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Ryosuke IWANAGA
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
Doron Vainrub
 
MSc Presentation
MSc PresentationMSc Presentation
MSc Presentation
eriprandopacces
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
J On The Beach
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
Jim Roepcke
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Steven Francia
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
07 2
07 207 2
07 2
a_b_g
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Ryosuke IWANAGA
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
J On The Beach
 
Ad

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 
Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 
Ad

Recently uploaded (20)

MEMS IC Substrate Technologies Guide 2025.pptx
MEMS IC Substrate Technologies Guide 2025.pptxMEMS IC Substrate Technologies Guide 2025.pptx
MEMS IC Substrate Technologies Guide 2025.pptx
IC substrate Shawn Wang
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Understanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdfUnderstanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdf
Fulcrum Concepts, LLC
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
ACE Aarhus - Team'25 wrap-up presentation
ACE Aarhus - Team'25 wrap-up presentationACE Aarhus - Team'25 wrap-up presentation
ACE Aarhus - Team'25 wrap-up presentation
DanielEriksen5
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
DNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in NepalDNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in Nepal
ICT Frame Magazine Pvt. Ltd.
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
MEMS IC Substrate Technologies Guide 2025.pptx
MEMS IC Substrate Technologies Guide 2025.pptxMEMS IC Substrate Technologies Guide 2025.pptx
MEMS IC Substrate Technologies Guide 2025.pptx
IC substrate Shawn Wang
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Understanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdfUnderstanding SEO in the Age of AI.pdf
Understanding SEO in the Age of AI.pdf
Fulcrum Concepts, LLC
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
ACE Aarhus - Team'25 wrap-up presentation
ACE Aarhus - Team'25 wrap-up presentationACE Aarhus - Team'25 wrap-up presentation
ACE Aarhus - Team'25 wrap-up presentation
DanielEriksen5
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 2 September 1, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of • Chuck Lam’s Hadoop In Action (2011) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Roots in Functional Programming Map f f f f f Fold g g g g g
  • 4. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 6. “Big Ideas”  Scale “out”, not “up”  Limits of SMP and large shared-memory machines  Move processing to the data  Cluster have limited bandwidth  Process data sequentially, avoid random access  Seeks are expensive, disk throughput is reasonable  Seamless scalability  From the mythical man-month to the tradable machine-hour
  • 7. Typical Large-Data Problem  Iterate over a large number of records  Compute something of interest from each  Shuffle and sort intermediate results  Aggregate intermediate results  Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)
  • 8. MapReduce Data Flow Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 9. MapReduce “Runtime”  Handles scheduling  Assigns workers to map and reduce tasks  Handles “data distribution”  Moves processes to data  Handles synchronization  Gathers, sorts, and shuffles intermediate data  Handles errors and faults  Detects worker failures and restarts  Built on a distributed file system
  • 10. MapReduce Programmers specify two functions map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) Note correspondence of types map output → reduce input Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  Map • Each split processed by same map node • map invoked iteratively: once per record in the split • For each record processed, map may emit 0-N (K2,V2) pairs  Reduce • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value • For each processed, reduce may emit 0-N (K3,V3) pairs  Each reducer’s output written to a persistent file in HDFS
  • 11. Input File Input File InputSplit InputSplit InputSplit InputSplit InputSplit InputFormat RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Source: redrawn from a slide by Cloduera, cc-licensed
  • 12. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30 Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  For each split, for each record, do map(K1,V1) (multiple calls)  Each map call may emit any number of (K2,V2) pairs (0-N) Run-time  Groups all values with the same key into ( K2, list(V2) )  Determines which reducer will process this  Copies data across network as needed for reducer  Ensures intra-node sort of keys processed by each reducer • No guarantee by default of inter-node total sort across reducers
  • 13. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 14. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 15. Partition  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Each distinct key (with associated values) sent to a single reducer • Same reduce node may process multiple keys in separate reduce() calls  Balances workload across reducers: equal number of keys to each • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)  Customizable • Some keys require more computation than others • e.g. value skew, or key-specific computation performed • For skew, sampling can dynamically estimate distribution & set partition • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
  • 16. Secondary Sorting (Lin 57, White 241)  How to output sorted bigrams (1st word, then list of 2nds)?  What if we use word1 as the key, word 2 as the value?  What if we use <first>--<second> as the key?  Pattern  Create a composite key of (first, second)  Define a Key Comparator based on both words • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)  Define a partition function based only on first word • All bigrams with the same first word go to same reducer • How do you know when the first word changes across invocations?  Preserve state in the reducer across invocations • Will be called separately for each bigram, but we want to remember the current first word across bigrams seen  Hadoop also provides Group Comparator
  • 17. Combine  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) combine ( K2, list(V2) ) → list ( K2, V2 )  Optional optimization  Local aggregation to reduce network traffic  No guarantee it will be used, how many times it will be called  Semantics of program cannot depend on its use  Signature: same input as reduce, same output as map  Combine may be run repeatedly on its own output  Lin: Associative & Commutative  combiner = reducer • See next slide
  • 18. Functional Properties  Associative: f( a, f(b,c) ) = f( f(a,b), c )  Grouping of operations doesn’t matter  YES: Addition, multiplication, concatenation  NO: division, subtraction, NAND  NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )  Commutative: f(a,b) = f(b,a)  Ordering of arguments doesn’t matter  YES: addition, multiplication, NAND  NO: division, subtraction, concatenation  Concatenate(“a,”b”) != concatenate(“b”,a”)  Distributive  White (p. 32) and Lam (p. 84) mention with regard to combiners  But really, go with associative + commutative in Lin (pp. 20, 27)
  • 19. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 20. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files Adapted from (Dean and Ghemawat, OSDI 2004)
  • 21. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 22. Distributed File System  Don’t move data… move computation to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data local  Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable  A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop
  • 23. GFS: Assumptions  Commodity hardware over “exotic” hardware  Scale “out”, not “up”  High component failure rates  Inexpensive commodity components fail all the time  “Modest” number of huge files  Multi-gigabyte files are common, if not encouraged  Files are write-once, mostly appended to  Perhaps concurrently  Large streaming reads over random access  High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 24. GFS: Design Decisions  Files stored as chunks  Fixed size (64MB)  Reliability through replication  Each chunk replicated across 3+ chunkservers  Single master to coordinate access, keep metadata  Simple centralized management  No data caching  Little benefit due to large datasets, streaming reads  Simplify the API  Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
  • 25. Basic Cluster Components  1 “Manager” node (can be split onto 2 nodes)  Namenode (NN)  Jobtracker (JT)  1-N “Worker” nodes  Tasktracker (TT)  Datanode (DN)  Optional Secondary Namenode  Periodic backups of Namenode in case of failure
  • 26. Hadoop Architecture  Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
  • 27. Namenode Responsibilities  Managing the file system namespace:  Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc.  Coordinating file operations:  Directs clients to datanodes for reads and writes  No data is moved through the namenode  Maintaining overall health:  Periodic communication with the datanodes  Block re-replication and rebalancing  Garbage collection
  • 28. Putting everything together… namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node
  • 29. Anatomy of a Job  MapReduce program in Hadoop = Hadoop job  Jobs are divided into map and reduce tasks (+ more!)  An instance of running a task is called a task attempt  Multiple jobs can be composed into a workflow  Job submission process  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker  JobClient computes input splits (on client end)  Job data (jar, configuration XML) are sent to JobTracker  JobTracker puts job data in shared location, enqueues tasks  TaskTrackers poll for tasks  Off to the races…
  • 30. Why have 1 API when you can have 2? White pp. 25-27, Lam pp. 77-80  Hadoop 0.19 and earlier had “old API”  Hadoop 0.21 and forward has “new API”  Hadoop 0.20 has both!  Old API most stable, but deprecated  Current books use old API predominantly, but discuss changes • Example code using new API available online from publisher  Some old API classes/methods not yet ported to new API  Cloud9 uses both, and you can too
  • 31. Old API  Mapper (interface)  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Reducer/Combiner  void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Partitioner  void getPartition(K2 key, V2 value, int numPartitions)
  • 32. New API  org.apache.hadoop.mapred now deprecated; instead use org.apache.hadoop.mapreduce & org.apache.hadoop.mapreduce.lib  Mapper, Reducer now abstract classes, not interfaces  Use Context instead of OutputCollector and Reporter  Context.write(), not OutputCollector.collect()  Reduce takes value list as Iterable, not Iterator  Can use java’s foreach syntax for iterating  Can throw InterruptedException as well as IOException  JobConf & JobClient replaced by Configuration & Job
  翻译: