SlideShare a Scribd company logo
Counting @ Scale
       Part II

            Sharad Goel
        Columbia University
Computational Social Science: Lecture 4

          February 15, 2013
Descriptive statistics
    (as opposed to inferential statistics)
      is about counting
        contingency tables
    means, variances, quantiles
summaries of conditional distributions
Counting @ scale
  conceptually easy
computationally hard
MapReduce:
Simplifed Data Processing on Large Clusters
      Jeffrey Dean and Sanjay Ghemawat
                  OSDI, 2004
Map
assign each input line to one or more groups

                  Shuffle
             aggregate groups

                 Reduce
         operate on grouped data
Map
assign each input line to one or more groups
        v  [(k1, v1), …, (km, vm)]

                  Shuffle
             aggregate groups

                  Reduce
         operate on grouped data
        (k, [v1, …, vn])  [w1, …, wp]
Group Average

              Input
    views (user, movie, rating)

             Output
average (mean & median) by movie
Group Average

          Map
identity (key := movie)

       Reduce
movie group  average
The Insight of MapReduce
       One can efficiently group identical items

Many tasks are computationally easier on grouped data
Filter

                 Input
    arbitrary data & filter condition

                 Output
subset of input data satisfying condition
Filter

                   Map
input  input if condition(input) else pass

                 Reduce
                 identity
Distinct

        Input
     set of items

        Output
subset of distinct items
Distinct

                 Map
               identity

               Reduce
grouped items  single item from group
Sample

               Input
set of items & sample probability p

            Output
     random subset of items
Sample

                  Map
input  input if rand(0,1) < p else pass

                Reduce
                identity
Sort

          Input
set of items (and a key)

       Output
 ordered set of items
Sort

                       Map
identity, with all data assigned to the same key

                    Reduce
                    identity


     *all the work happens in the shuffle
Sort

                 Map
identity, with key := first letter of line

                Reduce
                identity


*all the work happens in the shuffle
Sort

                       Sample
generate a small sample of the data (with MapReduce)

                Determine breakpoints
      sort the sample and compute percentiles
Sort

                     Map
identity, with key determined by breakpoints

                  Reduce
                  identity


 *most of the work happens in the shuffle
Combining data

                  Example
    for each user, want to compute the
average popularity of the movies they watch

                   Problem
    one file contains views (user, movie);
another file contains popularity (movie, rank)
Joins
User    Movie

 23      829

 789     24             User   Movie   Rank

 234    5678            23      829     34

  7      24             789     24     100

                        234    5678     4
Movie   Rank
                         7      24     100
5678      4

 24      100

 829     34
Nested-Loop Joins


For each user in users:
      For each movie in movies:
            if user.movie_id == movie.id:
                    output user.id, movie.id, movie.rating
Sort-Merge Joins
User    Movie

 789     24

  7      24               User     Movie   Rank

 23      829              789       24     100

 234    5678               7        24     100

                           23       829     34
Movie   Rank
                          234      5678     4
 24      100

 829     34

5678      4
Hash Joins
User    Movie

 23      829

 789     24

 234    5678

  7      24


Movie   Rank

5678      4

 24      100

 829     34
Distributed Joins

                Map
     reduce key := hash(join key)

                Reduce
        local (sort-merge) join


*also need to keep track of which table
    is the left and which is the right
Joins
{ inner, left, right, outer }
User   Movie
User    Sex
                23      829
23     male
                789     24
789    female
                234    5678
234    female
                 7      24
 7     male
                789     90
26     male
                23      758
567    female
                23      39
 2     female
                 2      782
User    Sex
                User   Activity
23     male
                23        3
789    female
                789       2
234    female
                234       1
 7     male
                 7        1
26     male
                789      90
567    female
                 2        1
 2     female
User                Sex
                                        User       Activity
23                 male
                                        23            3
789                female
                                        789           2
234                female
                                        234           1
 7                 male
                                          7           1
26                 male
                                        789           90
567                female
                                          2           1
 2                 female


                          User    Sex          Activity
                            23   male             3
       Left Join



                          789    female           2
                          234    female           1
                            7    male             1
                            26   male
                          567    female
                            2    female           1
User                 Sex
                                         User       Activity
23                  male
                                         23            3
789                 female
                                         789           2
234                 female
                                         234           1
 7                  male
                                           7           1
26                  male
                                         789           90
567                 female
                                           2           1
 2                  female


                           User    Sex          Activity
       Inner Join


                             23   male             3
                           789    female           2
                           234    female           1
                             7    male             1
                             2    female           1
Inner join
          returns pairs of rows in tables A & B
               that match join condition

                      Left (outer) join
         returns all rows from an inner join plus
rows in the left table that do not match to the right table

                      Full (outer) join
         returns all rows from an inner join plus
   rows in either table that do not match to the other
Map-side Joins

                     Map
      load (smaller) table into memory
stream through (larger) table and find matches

                   Reduce
                   identity
MapReduce Ops

           Map-only
Filter, sample, map-side joins

      Map & Reduce
 groupby, distinct, sort, join
The long tail

             Input
      (user, movie) views

             Output
for each user, average popularity
      of movies they watch
Step 1. compute movie popularity
  group views by movie & count
Step 2. Rank movies
 sort by popularity
Step 3. merge view and ranking data
join views and movie popularity tables
Step 4. compute eccentricity
group views/ranking by user and
     compute eccentricity
Pig Latin:
A Not-So-Foreign Language for Data Processing
   Olston, Reed, Srivastava, Kumar, and Tomkins
                  SIGMOD, 2008
Ad

More Related Content

Viewers also liked (20)

Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
jakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
jakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
jakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
jakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
jakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
jakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
jakehofman
 
практ7
практ7практ7
практ7
slavinskiy1
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOS
jorchuk
 
Estancias en Guadalajara
Estancias en GuadalajaraEstancias en Guadalajara
Estancias en Guadalajara
Alice Listing
 
практ3
практ3практ3
практ3
slavinskiy1
 
Conferencia educación católica versión final - abril 24, 2009..[1]
Conferencia educación católica   versión final - abril 24, 2009..[1]Conferencia educación católica   versión final - abril 24, 2009..[1]
Conferencia educación católica versión final - abril 24, 2009..[1]
julian
 
LAS PLANTAS
LAS PLANTASLAS PLANTAS
LAS PLANTAS
rosayago
 
No Esperes
No EsperesNo Esperes
No Esperes
guest410c45
 
4 de cada barroco
4 de cada barroco4 de cada barroco
4 de cada barroco
anamaria35
 
Starbucks
StarbucksStarbucks
Starbucks
Anupam Sharma
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
jakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
jakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
jakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
jakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
jakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
jakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
jakehofman
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOS
jorchuk
 
Estancias en Guadalajara
Estancias en GuadalajaraEstancias en Guadalajara
Estancias en Guadalajara
Alice Listing
 
Conferencia educación católica versión final - abril 24, 2009..[1]
Conferencia educación católica   versión final - abril 24, 2009..[1]Conferencia educación católica   versión final - abril 24, 2009..[1]
Conferencia educación católica versión final - abril 24, 2009..[1]
julian
 
LAS PLANTAS
LAS PLANTASLAS PLANTAS
LAS PLANTAS
rosayago
 
4 de cada barroco
4 de cada barroco4 de cada barroco
4 de cada barroco
anamaria35
 

More from jakehofman (14)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
jakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
jakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
jakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
jakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
jakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
jakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
jakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
jakehofman
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
jakehofman
 
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
jakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
jakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
jakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
jakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
jakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
jakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
jakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
jakehofman
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
jakehofman
 
Ad

Recently uploaded (20)

Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
How to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 InventoryHow to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 Inventory
Celine George
 
Look Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History EverywhereLook Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History Everywhere
History of Stoke Newington
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
Module_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptxModule_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptx
drroxannekemp
 
How to Add Button in Chatter in Odoo 18 - Odoo Slides
How to Add Button in Chatter in Odoo 18 - Odoo SlidesHow to Add Button in Chatter in Odoo 18 - Odoo Slides
How to Add Button in Chatter in Odoo 18 - Odoo Slides
Celine George
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
PoojaSen20
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
The History of Kashmir Karkota Dynasty NEP.pptx
The History of Kashmir Karkota Dynasty NEP.pptxThe History of Kashmir Karkota Dynasty NEP.pptx
The History of Kashmir Karkota Dynasty NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
libbys peer assesment.docx..............
libbys peer assesment.docx..............libbys peer assesment.docx..............
libbys peer assesment.docx..............
19lburrell
 
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docxPeer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
19lburrell
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
How to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 InventoryHow to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 Inventory
Celine George
 
Look Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History EverywhereLook Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History Everywhere
History of Stoke Newington
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
Module_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptxModule_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptx
drroxannekemp
 
How to Add Button in Chatter in Odoo 18 - Odoo Slides
How to Add Button in Chatter in Odoo 18 - Odoo SlidesHow to Add Button in Chatter in Odoo 18 - Odoo Slides
How to Add Button in Chatter in Odoo 18 - Odoo Slides
Celine George
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
DEATH & ITS TYPES AND PHYSIOLOGICAL CHANGES IN BODY AFTER DEATH, PATIENT WILL...
PoojaSen20
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
libbys peer assesment.docx..............
libbys peer assesment.docx..............libbys peer assesment.docx..............
libbys peer assesment.docx..............
19lburrell
 
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docxPeer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
19lburrell
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
Ad

Computational Social Science, Lecture 04: Counting at Scale, Part II

  • 1. Counting @ Scale Part II Sharad Goel Columbia University Computational Social Science: Lecture 4 February 15, 2013
  • 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions
  • 3. Counting @ scale conceptually easy computationally hard
  • 4. MapReduce: Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  • 5. Map assign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  • 6. Map assign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  • 7. Group Average Input views (user, movie, rating) Output average (mean & median) by movie
  • 8. Group Average Map identity (key := movie) Reduce movie group  average
  • 9. The Insight of MapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data
  • 10. Filter Input arbitrary data & filter condition Output subset of input data satisfying condition
  • 11. Filter Map input  input if condition(input) else pass Reduce identity
  • 12. Distinct Input set of items Output subset of distinct items
  • 13. Distinct Map identity Reduce grouped items  single item from group
  • 14. Sample Input set of items & sample probability p Output random subset of items
  • 15. Sample Map input  input if rand(0,1) < p else pass Reduce identity
  • 16. Sort Input set of items (and a key) Output ordered set of items
  • 17. Sort Map identity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
  • 18. Sort Map identity, with key := first letter of line Reduce identity *all the work happens in the shuffle
  • 19. Sort Sample generate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
  • 20. Sort Map identity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
  • 21. Combining data Example for each user, want to compute the average popularity of the movies they watch Problem one file contains views (user, movie); another file contains popularity (movie, rank)
  • 22. Joins User Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4 Movie Rank 7 24 100 5678 4 24 100 829 34
  • 23. Nested-Loop Joins For each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
  • 24. Sort-Merge Joins User Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34 Movie Rank 234 5678 4 24 100 829 34 5678 4
  • 25. Hash Joins User Movie 23 829 789 24 234 5678 7 24 Movie Rank 5678 4 24 100 829 34
  • 26. Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join *also need to keep track of which table is the left and which is the right
  • 27. Joins { inner, left, right, outer }
  • 28. User Movie User Sex 23 829 23 male 789 24 789 female 234 5678 234 female 7 24 7 male 789 90 26 male 23 758 567 female 23 39 2 female 2 782
  • 29. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female
  • 30. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
  • 31. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
  • 32. Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plus rows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
  • 33. Map-side Joins Map load (smaller) table into memory stream through (larger) table and find matches Reduce identity
  • 34. MapReduce Ops Map-only Filter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
  • 35. The long tail Input (user, movie) views Output for each user, average popularity of movies they watch
  • 36. Step 1. compute movie popularity group views by movie & count
  • 37. Step 2. Rank movies sort by popularity
  • 38. Step 3. merge view and ranking data join views and movie popularity tables
  • 39. Step 4. compute eccentricity group views/ranking by user and compute eccentricity
  • 40. Pig Latin: A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008
  翻译: