SlideShare a Scribd company logo
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 3
                  September 8, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Review
• Toward MapReduce “design patterns”
  – Building block: preserving state across calls
  – In-Map & In-Mapper combining (vs. combiners)
  – Secondary sorting (via value-to-key Conversion)
  – Pairs and Stripes
  – Order Inversion
• Group Work (examples)
  – Interlude: scaling counts, TF-IDF
Review
MapReduce: Recap
Required:
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
All values with the same key are reduced together
Optional:
   partition (K2, N) → Rj      maps K2 to some reducer Rj in [1..N]
      Often a simple hash of the key, e.g., hash(k’) mod n
      Divides up key space for parallel reduce operations


   combine ( K2, list(V2) ) → list ( K2, V2 )
      Mini-reducers that run in memory after the map phase
      Used as an optimization to reduce network traffic


The execution framework handles everything else…
“Everything Else”
    The execution framework handles everything else…
        Scheduling: assigns workers to map and reduce tasks
        ―Data distribution‖: moves processes to data
        Synchronization: gathers, sorts, and shuffles intermediate data
        Errors and faults: detects worker failures and restarts
    Limited control over data and execution flow
        All algorithms must expressed in m, r, c, p
    You don’t know:
        Where mappers and reducers run
        When a mapper or reducer begins or finishes
        Which input a particular mapper is processing
        Which intermediate key a particular reducer is processing
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6




  map                   map                    map                   map


a 1    b 2           c 3     c 6           a 5     c 2             b 7     c 8

 combine              combine               combine                 combine



a 1    b 2                 c 9             a 5     c 2             b 7     c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8




         reduce                   reduce                  reduce


             r1 s1                 r2 s2                   r3 s3
Shuffle and Sort

     Mapper                                   intermediate files
                                                   (on disk)
                              merged spills
                                (on disk)
                                                                   Combiner   Reducer



  circular buffer
   (in memory)


                           Combiner




                                                other reducers
        spills (on disk)




                       other mappers
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178


    Shuffle and 2 Sorts




   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Scalable Hadoop Algorithms: Themes
   Avoid object creation
       Inherently costly operation
       Garbage collection
   Avoid buffering
       Limited heap size
       Works for small datasets, but won’t scale!
         • Yet… we’ll talk about patterns involving buffering…
Importance of Local Aggregation
   Ideal scaling characteristics:
       Twice the data, twice the running time
       Twice the resources, half the running time
   Why can’t we achieve this?
       Synchronization requires communication
       Communication kills performance
   Thus… avoid communication!
       Reduce intermediate data via local aggregation
       Combiners can help
Tools for Synchronization
    Cleverly-constructed data structures
        Bring partial results together
    Sort order of intermediate keys
        Control order in which reducers process keys
    Partitioner
        Control which reducer processes which keys
    Preserving state in mappers and reducers
        Capture dependencies across multiple keys and values
Secondary Sorting
   MapReduce sorts input to reducers by key
       Values may be arbitrarily ordered
   What if want to sort value also?
       E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…
   Solutions?
       Swap key and value to sort by value?
       What if we use (k,v) as a joint key (and change nothing else)?
Secondary Sorting: Solutions
   Solution 1: Buffer values in memory, then sort
       Tradeoffs?
   Solution 2: ―Value-to-key conversion‖ design pattern
       Form composite intermediate key: (k, v1)
       Let execution framework do the sorting
       Preserve state across multiple key-value pairs
       …how do we make this happen?
Secondary Sorting (Lin 57, White 241)
    Create composite key: (k,v)
    Define a Key Comparator to sort via both
        Possibly not needed in some cases (e.g. strings & concatenation)
    Define a partition function based only on the (original) key
        All pairs with same key should go to same reducer
        Multiple keys may still go to the same reduce node; how do you
         know when the key changes across invocations of reduce()?
          • i.e. assume you want to do something with all values associated with
            a given key (e.g. print all on the same line, with no other keys)
    Preserve state in the reducer across invocations
        reduce() will be called separately for each pair, but we need to
         track the current key so we can detect when it changes


 Hadoop also provides Group Comparator
Preserving State in Hadoop


     Mapper object                                    Reducer object

                        one object per task
         state                                            state



       configure       API initialization hook          configure


                     one call per input
                     key-value pair
          map                                            reduce
                                   one call per
                                   intermediate key

         close           API cleanup hook                 close
Combiner Design
   Combiners and reducers share same method signature
       Sometimes, reducers can serve as combiners
       Often, not…
   Remember: combiner are optional optimizations
       Should not affect algorithm correctness
       May be run 0, 1, or multiple times
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combiner?
MapReduce Algorithm Design
Design Pattern for Local Aggregation
   ―In-mapper combining‖
       Fold the functionality of the combiner into the mapper,
        including preserving state across multiple map calls
   Advantages
       Speed
       Why is this faster than actual combiners?
         • Construction/deconstruction, serialization/deserialization
         • Guarantee and control use
   Disadvantages
       Buffering! Explicit memory management required
         • Can use disk-backed-buffer, based on # items or byes in memory
         • What if multiple mappers running on the same node? Do we know?
       Potential for order-dependent bugs
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combine = reduce
Word Count: in-map combining




Are combiners still needed?
Word Count: in-mapper combining




Are combiners still needed?
Example 2: Compute the Mean (v1)




Why can’t we use reducer as combiner?
Example 2: Compute the Mean (v2)




Why doesn’t this work?
Example 2: Compute the Mean (v3)
Computing the Mean:
                 in-mapper combining




Are combiners still needed?
Example 3: Term Co-occurrence
   Term co-occurrence matrix for a text collection
       M = N x N matrix (N = vocabulary size)
       Mij: number of times i and j co-occur in some context
        (for concreteness, let’s say context = sentence)
   Why?
       Distributional profiles as a way of measuring semantic distance
       Semantic distance useful for many language processing tasks
MapReduce: Large Counting Problems
   Term co-occurrence matrix for a text collection
    = specific instance of a large counting problem
       A large event space (number of terms)
       A large number of observations (the collection itself)
       Goal: keep track of interesting statistics about the events
   Basic approach
       Mappers generate partial counts
       Reducers aggregate partial counts



        How do we aggregate partial counts efficiently?
Approach 1: “Pairs”
    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For all pairs, emit (a, b) → count
    Reducers sum up counts associated with these pairs
    Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis
    Advantages
        Easy to implement, easy to understand
    Disadvantages
        Lots of pairs to sort and shuffle around (upper bound?)
        Not many opportunities for combiners to work
Another Try: “Stripes”
    Idea: group together pairs into an associative array
          (a, b) → 1
          (a, c) → 2
          (a, d) → 5                   a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
          (a, e) → 3
          (a, f) → 2

    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For each term, emit a → { b: countb, c: countc, d: countd … }
    Reducers perform element-wise sum of associative arrays
                a → { b: 1,       d: 5, e: 3 }
           +    a → { b: 1, c: 2, d: 2,       f: 2 }
                a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
Stripes: Pseudo-Code
“Stripes” Analysis
    Advantages
        Far less sorting and shuffling of key-value pairs
        Can make better use of combiners
    Disadvantages
        More difficult to implement
        Underlying object more heavyweight
        Fundamental limitation in terms of size of event space
          • Buffering!
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Relative Frequencies
   How do we estimate relative frequencies from counts?

                       count ( A, B)    count ( A, B)
          f ( B | A)                
                        count ( A)      count ( A, B' )
                                       B'



   Why do we want to do this?
   How do we do this with MapReduce?
f(B|A): “Stripes”

     a → {b1:3, b2 :12, b3 :7, b4 :1, … }


    Easy!
        One pass to compute (a, *)
        Another pass to directly compute f(B|A)
f(B|A): “Pairs”

         (a, *) → 32    Reducer holds this value in memory

         (a, b1) → 3                        (a, b1) → 3 / 32
         (a, b2) → 12                       (a, b2) → 12 / 32
         (a, b3) → 7                        (a, b3) → 7 / 32
         (a, b4) → 1                        (a, b4) → 1 / 32
         …                                  …


    For this to work:
        Must emit extra (a, *) for every bn in mapper
        Must make sure all a’s get sent to same reducer (use partitioner)
        Must make sure (a, *) comes first (define sort order)
        Must hold state in reducer across different key-value pairs
“Order Inversion”
    Common design pattern
        Computing relative frequencies requires marginal counts
        But marginal cannot be computed until you see all counts
        Buffering is a bad idea!
        Trick: getting the marginal counts to arrive at the reducer before
         the joint counts
    Optimizations
        Apply in-memory combining pattern to accumulate marginal counts
        Should we apply combiners?
Synchronization: Pairs vs. Stripes
    Approach 1: turn synchronization into an ordering problem
        Sort keys into correct order of computation
        Partition key space so that each reducer gets the appropriate set
         of partial results
        Hold state in reducer across multiple key-value pairs to perform
         computation
        Illustrated by the ―pairs‖ approach
    Approach 2: construct data structures that bring partial
     results together
        Each reducer receives all the data it needs to complete the
         computation
        Illustrated by the ―stripes‖ approach
Recap: Tools for Synchronization
   Cleverly-constructed data structures
       Bring data together
   Sort order of intermediate keys
       Control order in which reducers process keys
   Partitioner
       Control which reducer processes which keys
   Preserving state in mappers and reducers
       Capture dependencies across multiple keys and values
Issues and Tradeoffs
   Number of key-value pairs
       Object creation overhead
       Time for sorting and shuffling pairs across the network
   Size of each key-value pair
       De/serialization overhead
   Local aggregation
       Opportunities to perform local aggregation varies
       Combiners make a big difference
       Combiners vs. in-mapper combining
       RAM vs. disk vs. network
Group Work (Examples)
Task 5
   How many distinct words in the document collection start
    with each letter?
       Note: ―types‖ vs. ―tokens‖
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)




   Ways to make more efficient?
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        set of words = empty set;
        for each word
           add word to set
        emit(letter, size word set)



   Ways to make more efficient?
Task 5b
   How many distinct words in the document collection start
    with each letter?
        How to use in-mapper combining and a separate combiner
        Tradeoffs

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)
Task 5b
   How many distinct words in the document collection start
    with each letter?
       How to use in-mapper combining and a separate combiner
       Tradeoffs?

 Mapper<String,String  String,String>
 Map(String docID, String document)
     for each word in document
          emit (first character, word)

 Combiner<String,String  String,String>
 Combine(String letter, Iterator<String> words):
     set of words = empty set;
     for each word
        add word to set
     for each word in set
     emit(letter, word)
Task 6: find median document length
Task 6: find median document length
  Mapper<K1,V1  Integer,Integer>
  Map(K1 xx, V1 xx)
    10,000 / N times
       emit( length(generateRandomDocument()), 1)
Task 6: find median document length
    Mapper<K1,V1  Integer,Integer>
    Map(K1 xx, V1 xx)
      10,000 / N times
         emit( length(generateRandomDocument()), 1)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        static list lengths = empty list;
        for each value
           append length to list

    Close() { output median }




   conf.setNumReduceTasks(1)
   Problems with this solution?
Interlude: Scaling counts
    Many applications require counts of words in some
     context.
        E.g. information retrieval, vector-based semantics
    Counts from frequent words like ―the‖ can overwhelm the
     signal from content words such as ―stocks‖ and ―football‖
    Two strategies for combating high frequency words:
        Use a stop list that excludes them
        Scale the counts such that high frequency words are downgraded.
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts using DF
    Recall the word co-occurrence counts task from the earlier
     slides.
        mij represents the number of times word j has occurred in the
         neighborhood of word i.
        The row mi gives a vector profile of word i that we can use for
         tasks like determining word similarity (e.g. using cosine distance)
        Words like ―the‖ will tend to have high counts that we want to scale
         down so they don’t dominate this computation.
    The counts in mij can be scaled down using dfj. Let’s
     create a transformed matrix S where:
Task 7
     Compute S, the co-occurrence counts scaled by document
      frequency.
       • First: do the simplest mapper
       • Then: simplify things for the reducer
Ad

More Related Content

What's hot (20)

Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
Fernando Rodriguez
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
vishal choudhary
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
Scalding
ScaldingScalding
Scalding
Mario Pastorelli
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
Electronic Arts / DICE
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Mark Kilgard
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
Mark Kilgard
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
Doron Vainrub
 
Assignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceAssignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduce
Shantanu Sharma
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
Fernando Rodriguez
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
Electronic Arts / DICE
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Mark Kilgard
 
Assignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceAssignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduce
Shantanu Sharma
 

Viewers also liked (14)

Curso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSCurso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaS
Asimov Consultores
 
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Mario Jose Villamizar Cano
 
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsFrameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Mario Jose Villamizar Cano
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
Shrihari Rathod
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
eEvolution GmbH &amp; Co. KG
 
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Néstor González
 
Modelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSModelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaS
Sergio Montoro Ten
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Curso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSCurso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaS
Asimov Consultores
 
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Mario Jose Villamizar Cano
 
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsFrameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Mario Jose Villamizar Cano
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
Shrihari Rathod
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
eEvolution GmbH &amp; Co. KG
 
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Néstor González
 
Modelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSModelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaS
Sergio Montoro Ten
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Ad

Similar to Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Steven Francia
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
Hadoop
HadoopHadoop
Hadoop
Scott Leberknight
 
Lecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduceLecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduce
anasbro009
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
EMC
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
Sitamarhi Institute of Technology
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
Evgeny Benediktov
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Steven Francia
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
Lecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduceLecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduce
anasbro009
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
EMC
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Ad

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 
Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 

Recently uploaded (20)

Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 3 September 8, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Review • Toward MapReduce “design patterns” – Building block: preserving state across calls – In-Map & In-Mapper combining (vs. combiners) – Secondary sorting (via value-to-key Conversion) – Pairs and Stripes – Order Inversion • Group Work (examples) – Interlude: scaling counts, TF-IDF
  • 5. MapReduce: Recap Required: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) All values with the same key are reduced together Optional: partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Often a simple hash of the key, e.g., hash(k’) mod n  Divides up key space for parallel reduce operations combine ( K2, list(V2) ) → list ( K2, V2 )  Mini-reducers that run in memory after the map phase  Used as an optimization to reduce network traffic The execution framework handles everything else…
  • 6. “Everything Else”  The execution framework handles everything else…  Scheduling: assigns workers to map and reduce tasks  ―Data distribution‖: moves processes to data  Synchronization: gathers, sorts, and shuffles intermediate data  Errors and faults: detects worker failures and restarts  Limited control over data and execution flow  All algorithms must expressed in m, r, c, p  You don’t know:  Where mappers and reducers run  When a mapper or reducer begins or finishes  Which input a particular mapper is processing  Which intermediate key a particular reducer is processing
  • 7. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 8. Shuffle and Sort Mapper intermediate files (on disk) merged spills (on disk) Combiner Reducer circular buffer (in memory) Combiner other reducers spills (on disk) other mappers
  • 9. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 10. Scalable Hadoop Algorithms: Themes  Avoid object creation  Inherently costly operation  Garbage collection  Avoid buffering  Limited heap size  Works for small datasets, but won’t scale! • Yet… we’ll talk about patterns involving buffering…
  • 11. Importance of Local Aggregation  Ideal scaling characteristics:  Twice the data, twice the running time  Twice the resources, half the running time  Why can’t we achieve this?  Synchronization requires communication  Communication kills performance  Thus… avoid communication!  Reduce intermediate data via local aggregation  Combiners can help
  • 12. Tools for Synchronization  Cleverly-constructed data structures  Bring partial results together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 13. Secondary Sorting  MapReduce sorts input to reducers by key  Values may be arbitrarily ordered  What if want to sort value also?  E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…  Solutions?  Swap key and value to sort by value?  What if we use (k,v) as a joint key (and change nothing else)?
  • 14. Secondary Sorting: Solutions  Solution 1: Buffer values in memory, then sort  Tradeoffs?  Solution 2: ―Value-to-key conversion‖ design pattern  Form composite intermediate key: (k, v1)  Let execution framework do the sorting  Preserve state across multiple key-value pairs  …how do we make this happen?
  • 15. Secondary Sorting (Lin 57, White 241)  Create composite key: (k,v)  Define a Key Comparator to sort via both  Possibly not needed in some cases (e.g. strings & concatenation)  Define a partition function based only on the (original) key  All pairs with same key should go to same reducer  Multiple keys may still go to the same reduce node; how do you know when the key changes across invocations of reduce()? • i.e. assume you want to do something with all values associated with a given key (e.g. print all on the same line, with no other keys)  Preserve state in the reducer across invocations  reduce() will be called separately for each pair, but we need to track the current key so we can detect when it changes Hadoop also provides Group Comparator
  • 16. Preserving State in Hadoop Mapper object Reducer object one object per task state state configure API initialization hook configure one call per input key-value pair map reduce one call per intermediate key close API cleanup hook close
  • 17. Combiner Design  Combiners and reducers share same method signature  Sometimes, reducers can serve as combiners  Often, not…  Remember: combiner are optional optimizations  Should not affect algorithm correctness  May be run 0, 1, or multiple times
  • 18. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 19. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combiner?
  • 21. Design Pattern for Local Aggregation  ―In-mapper combining‖  Fold the functionality of the combiner into the mapper, including preserving state across multiple map calls  Advantages  Speed  Why is this faster than actual combiners? • Construction/deconstruction, serialization/deserialization • Guarantee and control use  Disadvantages  Buffering! Explicit memory management required • Can use disk-backed-buffer, based on # items or byes in memory • What if multiple mappers running on the same node? Do we know?  Potential for order-dependent bugs
  • 22. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combine = reduce
  • 23. Word Count: in-map combining Are combiners still needed?
  • 24. Word Count: in-mapper combining Are combiners still needed?
  • 25. Example 2: Compute the Mean (v1) Why can’t we use reducer as combiner?
  • 26. Example 2: Compute the Mean (v2) Why doesn’t this work?
  • 27. Example 2: Compute the Mean (v3)
  • 28. Computing the Mean: in-mapper combining Are combiners still needed?
  • 29. Example 3: Term Co-occurrence  Term co-occurrence matrix for a text collection  M = N x N matrix (N = vocabulary size)  Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)  Why?  Distributional profiles as a way of measuring semantic distance  Semantic distance useful for many language processing tasks
  • 30. MapReduce: Large Counting Problems  Term co-occurrence matrix for a text collection = specific instance of a large counting problem  A large event space (number of terms)  A large number of observations (the collection itself)  Goal: keep track of interesting statistics about the events  Basic approach  Mappers generate partial counts  Reducers aggregate partial counts How do we aggregate partial counts efficiently?
  • 31. Approach 1: “Pairs”  Each mapper takes a sentence:  Generate all co-occurring term pairs  For all pairs, emit (a, b) → count  Reducers sum up counts associated with these pairs  Use combiners!
  • 33. “Pairs” Analysis  Advantages  Easy to implement, easy to understand  Disadvantages  Lots of pairs to sort and shuffle around (upper bound?)  Not many opportunities for combiners to work
  • 34. Another Try: “Stripes”  Idea: group together pairs into an associative array (a, b) → 1 (a, c) → 2 (a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } (a, e) → 3 (a, f) → 2  Each mapper takes a sentence:  Generate all co-occurring term pairs  For each term, emit a → { b: countb, c: countc, d: countd … }  Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 } + a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
  • 36. “Stripes” Analysis  Advantages  Far less sorting and shuffling of key-value pairs  Can make better use of combiners  Disadvantages  More difficult to implement  Underlying object more heavyweight  Fundamental limitation in terms of size of event space • Buffering!
  • 37. Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
  • 39. Relative Frequencies  How do we estimate relative frequencies from counts? count ( A, B) count ( A, B) f ( B | A)   count ( A)  count ( A, B' ) B'  Why do we want to do this?  How do we do this with MapReduce?
  • 40. f(B|A): “Stripes” a → {b1:3, b2 :12, b3 :7, b4 :1, … }  Easy!  One pass to compute (a, *)  Another pass to directly compute f(B|A)
  • 41. f(B|A): “Pairs” (a, *) → 32 Reducer holds this value in memory (a, b1) → 3 (a, b1) → 3 / 32 (a, b2) → 12 (a, b2) → 12 / 32 (a, b3) → 7 (a, b3) → 7 / 32 (a, b4) → 1 (a, b4) → 1 / 32 … …  For this to work:  Must emit extra (a, *) for every bn in mapper  Must make sure all a’s get sent to same reducer (use partitioner)  Must make sure (a, *) comes first (define sort order)  Must hold state in reducer across different key-value pairs
  • 42. “Order Inversion”  Common design pattern  Computing relative frequencies requires marginal counts  But marginal cannot be computed until you see all counts  Buffering is a bad idea!  Trick: getting the marginal counts to arrive at the reducer before the joint counts  Optimizations  Apply in-memory combining pattern to accumulate marginal counts  Should we apply combiners?
  • 43. Synchronization: Pairs vs. Stripes  Approach 1: turn synchronization into an ordering problem  Sort keys into correct order of computation  Partition key space so that each reducer gets the appropriate set of partial results  Hold state in reducer across multiple key-value pairs to perform computation  Illustrated by the ―pairs‖ approach  Approach 2: construct data structures that bring partial results together  Each reducer receives all the data it needs to complete the computation  Illustrated by the ―stripes‖ approach
  • 44. Recap: Tools for Synchronization  Cleverly-constructed data structures  Bring data together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 45. Issues and Tradeoffs  Number of key-value pairs  Object creation overhead  Time for sorting and shuffling pairs across the network  Size of each key-value pair  De/serialization overhead  Local aggregation  Opportunities to perform local aggregation varies  Combiners make a big difference  Combiners vs. in-mapper combining  RAM vs. disk vs. network
  • 47. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖
  • 48. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)  Ways to make more efficient?
  • 49. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): set of words = empty set; for each word add word to set emit(letter, size word set)  Ways to make more efficient?
  • 50. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)
  • 51. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs? Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Combiner<String,String  String,String> Combine(String letter, Iterator<String> words): set of words = empty set; for each word add word to set for each word in set emit(letter, word)
  • 52. Task 6: find median document length
  • 53. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1)
  • 54. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): static list lengths = empty list; for each value append length to list Close() { output median }  conf.setNumReduceTasks(1)  Problems with this solution?
  • 55. Interlude: Scaling counts  Many applications require counts of words in some context.  E.g. information retrieval, vector-based semantics  Counts from frequent words like ―the‖ can overwhelm the signal from content words such as ―stocks‖ and ―football‖  Two strategies for combating high frequency words:  Use a stop list that excludes them  Scale the counts such that high frequency words are downgraded.
  • 56. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 57. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 58. Interlude: Scaling counts using DF  Recall the word co-occurrence counts task from the earlier slides.  mij represents the number of times word j has occurred in the neighborhood of word i.  The row mi gives a vector profile of word i that we can use for tasks like determining word similarity (e.g. using cosine distance)  Words like ―the‖ will tend to have high counts that we want to scale down so they don’t dominate this computation.  The counts in mij can be scaled down using dfj. Let’s create a transformed matrix S where:
  • 59. Task 7  Compute S, the co-occurrence counts scaled by document frequency. • First: do the simplest mapper • Then: simplify things for the reducer
  翻译: