SlideShare a Scribd company logo
Duplicate detection
Kira Radinsky
Based on the Standford slides by Christopher Manning
and Prabhakar Raghavan
Duplication
• ~30% of the content on the Web is near-duplicate pages
– Pages with content that is nearly identical to that of other
pages
• Issues:
– Index duplicate content only once
– Return only one version in the search results
– How can near-duplicate pages be identified in a scalable
and reliable manner?
Duplicate/Near-Duplicate Detection
• Duplication: Exact match can be detected
with fingerprints
• Near-Duplication: Approximate match
– Compute syntactic similarity with an edit-distance
measure
– Use similarity threshold to detect near-duplicates
• E.g., Similarity > 80% => Documents are near
duplicates
• Not transitive though sometimes used transitively
Computing Similarity - Shingles
Features for Similarity (Shingles (Word N-Grams))
• Segments of a document (natural or artificial breakpoints)
• K-shingling of a document transforms the document into a
set containing all windows of k contiguous terms
• E.g., 4-shingling of
“My name is Inigo Montoya. You killed my father. Prepare to die”:
{
my name is inigo
name is inigo montoya
is inigo montoya you
inigo montoya you killed
montoya you killed my,
you killed my father
killed my father prepare
my father prepare to
father prepare to die
}
Computing Similarity – distance metric
Similarity Measurement
• Denote by Sk(d) the k-shingling of document d
• Definition: the resemblance of d1 and d2,
R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)|
• The distance measure Δ(d1,d2) = 1-R(d1,d2) is a
metric
Shingles + Set Intersection
1. Computing exact set intersection of shingles
between all pairs of documents is
expensive/intractable
– Approximate using a cleverly chosen subset of
shingles from each (a sketch)
2. Estimate (size_of_intersection /
size_of_union) based on a short sketch
Doc
A Shingle set A Sketch A
Doc B
Shingle set B Sketch B
Jaccard
Sketch of a document
Create a sketch vector (of size ~200) for each document
– Documents that share ≥ t (usually 80%) corresponding
vector elements will be considered near duplicates
– Definitions:
• Let f map all k-shingles in the universe to 0..2m (e.g., f =
fingerprinting)
• Let p be a random permutation on 0..2m
– For doc D, sketchD is as follows:
• Sketch Option 1:
Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after
applying Π to the k-shingling of d
• Sketch Option 2:
Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
From Shingles to Sketches (cont.)
• A function X is an unbiased estimator of a value Y if E(X)=Y
• Theorem: when choosing Π u.a.r., the following functions are
unbiased estimators of R(d1,d2):
• | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | /
| minm(Fm(d1)  Fm(d2)) |
• | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) |
• Either Fm(d) or Vn(d) can be chosen as d’s sketch
– Fm(d) has the advantage that it is of fixed size
– Vn(d) is easier to compute
Multiple Sketches
• Let pi be a random permutation on 0..2m
• For doc D, sketchD[ i ] is as follows:
– Let f map all shingles in the universe to 0..2m (e.g.,
f = fingerprinting)
– Let pi be a random permutation on 0..2m
– Pick MIN {pi(f(s))} over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number line
with pi
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: p1, p2,… p200
A B
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is
common to both (i.e., lies in the intersection)
Claim: This happens with probability
Size_of_intersection / Size_of_union
Why?
BA
Set Similarity of sets Ci , Cj
• View sets as columns of a matrix A
– one row for each element in the universe.
– aij = 1 indicates presence of item i in set j
• Example
ji
ji
ji
CC
CC
)C,Jaccard(C



C1 C2
0 1
1 0
1 1 Jaccard(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
Key Observation
• For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
• Let A = # of rows of type A
• Claim
CBA
A
)C,Jaccard(C ji


Min Hashing
• Randomly permute rows
• Hash h(Ci) = index of first row with 1 in
column Ci
• Surprising Property
• Why?
– Both are A/(A+B+C)
– Look down columns Ci, Cj until first non-Type-D
row
– h(Ci) = h(Cj)  type A row
   jiji C,CJaccard)h(C)h(CP 
Min-Hash sketches
• Pick P random row permutations
• MinHash sketch
SketchD = list of P indexes of first rows with 1 in
column C
• Similarity of signatures
– Let sim[sketch(Ci),sketch(Cj)] = fraction of
permutations where MinHash values agree
– Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
Example
C1 C2 C3
R1 1 0 1
R2 0 1 1
R3 1 0 0
R4 1 0 1
R5 0 1 0
Signatures
S1 S2 S3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4
Similarities
1-2 1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
Algorithm for Clustering Near-Duplicate
Documents
1. Compute the sketch of each document
2. From each sketch, produce a list of <shingle, docID> pairs
3. Group all pairs by shingle value
4. For any shingle that is shared by more than one document, output a
triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing
that shingle
5. Sort and aggregate the list of triplets, producing final triplets of the
form <smaller-docID, larger-docID, # common shingles>
6. Join any pair of documents whose number of common shingles
exceeds a chosen threshold using a “Union-Find” algorithm
7. Each resulting connected component of the UF algorithm is a cluster
of near-duplicate documents
Implementation nicely fits the “map-reduce” programming paradigm
Tutorial 4 (duplicate detection)
Implementation Trick
• Permuting universe even once is prohibitive
• Row Hashing
– Pick P hash functions hk: {1,…,n}{1,…,O(n)}
– Ordering under hk gives random permutation of
rows
• One-pass Implementation
– For each Ci and hk, keep slot for min-hash value
– Initialize all slot(Ci,hk) to infinity
– Scan rows in arbitrary order looking for 1’s
• Suppose row Rj has 1 in column Ci
• For each hk,
– if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
Example
C1 C2
R1 1 0
R2 0 1
R3 1 1
R4 1 0
R5 0 1
h(x) = x mod 5
g(x) = 2x+1 mod 5
h(1) = 1 1 -
g(1) = 3 3 -
h(2) = 2 1 2
g(2) = 0 3 0
h(3) = 3 1 2
g(3) = 2 2 0
h(4) = 4 1 2
g(4) = 4 2 0
h(5) = 0 1 0
g(5) = 1 2 0
C1 slots C2 slots
Ad

More Related Content

What's hot (20)

Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
Ajendra Sharma
 
Analysis of statistical data in heath information management
Analysis of statistical data in heath information managementAnalysis of statistical data in heath information management
Analysis of statistical data in heath information management
Saleh Ahmed
 
Data analysis copy
Data analysis   copyData analysis   copy
Data analysis copy
KAVITHAMONTADKA
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
SOMASUNDARAM T
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
Neny Isharyanti
 
Methods of organizing data
Methods of organizing dataMethods of organizing data
Methods of organizing data
Roxane La'O
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail store
Siddharth Chaudhary
 
SAS BASICS
SAS BASICSSAS BASICS
SAS BASICS
Bhuwanesh Rawat
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
Setia Pramana
 
Basics stat ppt-types of data
Basics stat ppt-types of dataBasics stat ppt-types of data
Basics stat ppt-types of data
Farhana Shaheen
 
Introduction to Tableau
Introduction to Tableau Introduction to Tableau
Introduction to Tableau
Mithileysh Sathiyanarayanan
 
Statistics
StatisticsStatistics
Statistics
Arpit Sharma
 
DATA Types
DATA TypesDATA Types
DATA Types
Aniruddha Deshmukh
 
(Manual spss)
(Manual spss)(Manual spss)
(Manual spss)
Enas Ahmed
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Vishwas N
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
Thanakrit Lersmethasakul
 
What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?
Bernard Marr
 
ETL Process
ETL ProcessETL Process
ETL Process
Rashmi Bhat
 
Introduction to Stata
Introduction to Stata Introduction to Stata
Introduction to Stata
Samaa Hazem Hosny
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
CIToolkit
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
Ajendra Sharma
 
Analysis of statistical data in heath information management
Analysis of statistical data in heath information managementAnalysis of statistical data in heath information management
Analysis of statistical data in heath information management
Saleh Ahmed
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
SOMASUNDARAM T
 
Methods of organizing data
Methods of organizing dataMethods of organizing data
Methods of organizing data
Roxane La'O
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail store
Siddharth Chaudhary
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
Setia Pramana
 
Basics stat ppt-types of data
Basics stat ppt-types of dataBasics stat ppt-types of data
Basics stat ppt-types of data
Farhana Shaheen
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Vishwas N
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
Thanakrit Lersmethasakul
 
What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?
Bernard Marr
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
CIToolkit
 

Viewers also liked (20)

Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
ieeepondy
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
jonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
Likan Patra
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
Anish Raivadera
 
Production&creative
Production&creativeProduction&creative
Production&creative
Red Keds
 
TIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLANDTIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLAND
Brandon Uriel
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...
IOSR Journals
 
Web components
Web componentsWeb components
Web components
Mario Mendonça
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)
F. Ovies
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.
Promodo
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends Streams
Webtrends
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM Optimization
Webtrends
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection
Heinz Marketing Inc
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China
MSL
 
EAGLENEST
EAGLENESTEAGLENEST
EAGLENEST
karim jaria
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)
Janice Fraser
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
GangSeok Lee
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
ieeepondy
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
jonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
Likan Patra
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
Anish Raivadera
 
Production&creative
Production&creativeProduction&creative
Production&creative
Red Keds
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...
IOSR Journals
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)
F. Ovies
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.
Promodo
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends Streams
Webtrends
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM Optimization
Webtrends
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection
Heinz Marketing Inc
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China
MSL
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)
Janice Fraser
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
GangSeok Lee
 
Ad

Similar to Tutorial 4 (duplicate detection) (20)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
 
image compresson
image compressonimage compresson
image compresson
Ajay Kumar
 
Image compression
Image compressionImage compression
Image compression
Ale Johnsan
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 
Nn
NnNn
Nn
mattriley
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches
AXEL FOTSO
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
Vincent Traag
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
Lec10 matching
Lec10 matchingLec10 matching
Lec10 matching
Suravet Konsetthee
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
Luiz Henrique Zambom Santana
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
dudoo1
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
ssuser6d1fca
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
 
image compresson
image compressonimage compresson
image compresson
Ajay Kumar
 
Image compression
Image compressionImage compression
Image compression
Ale Johnsan
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches
AXEL FOTSO
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
Vincent Traag
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
Luiz Henrique Zambom Santana
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
dudoo1
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
ssuser6d1fca
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild
 
Ad

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 

Recently uploaded (20)

Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Top-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptxTop-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptx
BR Softech
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Top-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptxTop-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptx
BR Softech
 

Tutorial 4 (duplicate detection)

  • 1. Duplicate detection Kira Radinsky Based on the Standford slides by Christopher Manning and Prabhakar Raghavan
  • 2. Duplication • ~30% of the content on the Web is near-duplicate pages – Pages with content that is nearly identical to that of other pages • Issues: – Index duplicate content only once – Return only one version in the search results – How can near-duplicate pages be identified in a scalable and reliable manner?
  • 3. Duplicate/Near-Duplicate Detection • Duplication: Exact match can be detected with fingerprints • Near-Duplication: Approximate match – Compute syntactic similarity with an edit-distance measure – Use similarity threshold to detect near-duplicates • E.g., Similarity > 80% => Documents are near duplicates • Not transitive though sometimes used transitively
  • 4. Computing Similarity - Shingles Features for Similarity (Shingles (Word N-Grams)) • Segments of a document (natural or artificial breakpoints) • K-shingling of a document transforms the document into a set containing all windows of k contiguous terms • E.g., 4-shingling of “My name is Inigo Montoya. You killed my father. Prepare to die”: { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die }
  • 5. Computing Similarity – distance metric Similarity Measurement • Denote by Sk(d) the k-shingling of document d • Definition: the resemblance of d1 and d2, R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)| • The distance measure Δ(d1,d2) = 1-R(d1,d2) is a metric
  • 6. Shingles + Set Intersection 1. Computing exact set intersection of shingles between all pairs of documents is expensive/intractable – Approximate using a cleverly chosen subset of shingles from each (a sketch) 2. Estimate (size_of_intersection / size_of_union) based on a short sketch Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard
  • 7. Sketch of a document Create a sketch vector (of size ~200) for each document – Documents that share ≥ t (usually 80%) corresponding vector elements will be considered near duplicates – Definitions: • Let f map all k-shingles in the universe to 0..2m (e.g., f = fingerprinting) • Let p be a random permutation on 0..2m – For doc D, sketchD is as follows: • Sketch Option 1: Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after applying Π to the k-shingling of d • Sketch Option 2: Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
  • 8. From Shingles to Sketches (cont.) • A function X is an unbiased estimator of a value Y if E(X)=Y • Theorem: when choosing Π u.a.r., the following functions are unbiased estimators of R(d1,d2): • | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | / | minm(Fm(d1)  Fm(d2)) | • | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) | • Either Fm(d) or Vn(d) can be chosen as d’s sketch – Fm(d) has the advantage that it is of fixed size – Vn(d) is easier to compute
  • 9. Multiple Sketches • Let pi be a random permutation on 0..2m • For doc D, sketchD[ i ] is as follows: – Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) – Let pi be a random permutation on 0..2m – Pick MIN {pi(f(s))} over all shingles s in D
  • 10. Computing Sketch[i] for Doc1 Document 1 264 264 264 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value
  • 11. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200 A B
  • 12. However… Document 1 Document 2 264 264 264 264 264 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union Why? BA
  • 13. Set Similarity of sets Ci , Cj • View sets as columns of a matrix A – one row for each element in the universe. – aij = 1 indicates presence of item i in set j • Example ji ji ji CC CC )C,Jaccard(C    C1 C2 0 1 1 0 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 14. Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Let A = # of rows of type A • Claim CBA A )C,Jaccard(C ji  
  • 15. Min Hashing • Randomly permute rows • Hash h(Ci) = index of first row with 1 in column Ci • Surprising Property • Why? – Both are A/(A+B+C) – Look down columns Ci, Cj until first non-Type-D row – h(Ci) = h(Cj)  type A row    jiji C,CJaccard)h(C)h(CP 
  • 16. Min-Hash sketches • Pick P random row permutations • MinHash sketch SketchD = list of P indexes of first rows with 1 in column C • Similarity of signatures – Let sim[sketch(Ci),sketch(Cj)] = fraction of permutations where MinHash values agree – Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
  • 17. Example C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00
  • 18. Algorithm for Clustering Near-Duplicate Documents 1. Compute the sketch of each document 2. From each sketch, produce a list of <shingle, docID> pairs 3. Group all pairs by shingle value 4. For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing that shingle 5. Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles> 6. Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find” algorithm 7. Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the “map-reduce” programming paradigm
  • 20. Implementation Trick • Permuting universe even once is prohibitive • Row Hashing – Pick P hash functions hk: {1,…,n}{1,…,O(n)} – Ordering under hk gives random permutation of rows • One-pass Implementation – For each Ci and hk, keep slot for min-hash value – Initialize all slot(Ci,hk) to infinity – Scan rows in arbitrary order looking for 1’s • Suppose row Rj has 1 in column Ci • For each hk, – if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
  • 21. Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 C1 slots C2 slots
  翻译: