SlideShare a Scribd company logo
Name Matching at Scale:
CPU, GPU or SPARK?
Wendell Kuling and Chris Broeren
ING Wholesale Banking Advanced Analytics Team
Chris Broeren,
Data Scientist
Wendell Kuling,
Data Scientist
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Introduction
Wholesale bank = dealing with companies
Interested in different data sets about companies
To join multiple data sets together, we need a common key: company name
However one company may be called by different name:
: McDonalds Corporation, McDonalds, McDonald’s Corp, etc…
Therefore we need to match approximately similar names of companies
together
Introduction
Define an existing list of company names as the ground truth (G)
Aim: match new sets of names (S1, S2, S3, … ) with G:
Without loss of generality, let’s assume we’re going to match one set of names, S with G for this talk
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
Source 1Ground Truth
ABN Amro N.V
RBS LLC
Rabobank N.V
JPM USA
ING Groep
ASN
Chase
BINCK N.V
HSBC
Westpac
GS Global
Source 2
ABN Amro N.V
RBS LLC
RABOBANK NV
JPM USA
ING N.V.
ASN
Chase Bank
BINCK N.V
HSBC
Westpac Aus
GS Global
Source 3
G S1 S2 S3
Introduction
Many ways to look at problem:
• Approximate string match problem
• Nearest Neighbour Search problem
• Pattern matching
• etc…
We need to find the “closest” name in G to match to every name in S
Reality
In our first case:
• G has 12 million names
• S ranges in length between 3000 and 5 mln names
To make matters worse:
• On average, a name is 31 characters long, containing ~4 words
• The world isn’t UTF8 compliant, we have over 160 characters
• Although there are limited duplicates in G, some companies have similar
names and have hierarchical structures which must be observed
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Brute Force Method
Define a function to measure word closeness:
The closer the names are to each other, the more similar they are
Calculate closeness for each word and choose the closest
Ensemble with different functions to get better results
Brute Force Method
There are many word similarity functions. An example is the Levenshtein distance.
Levenshtein distance calculates the minimum number of character edits
(replacing, adding or subtracting) it takes to make two strings equal.
Example: levenshtein(“ABN Amro Bank”, “RBS Bank”)
• ABN Amro Bank —> RBN Amro Bank (replace A with R)
• RBN Amro Bank —> RBN Bank (remove Amro)
• RBN Bank —> RBS Bank (replace N with S)
Therefore Levenshtein(“ABN Amro Bank”, “RBS Bank”) = 1 + 4 + 1
Brute Force Method
• “ABN Amro Bank” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute Force Method
• “RBS Bank” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute Force Method
• “Goldman Sachs” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute force method
• Problem: 12 million names in G, 5 million names in S
• This is 60,000,000,000,000 similarity calculations
• Levenshtein algorithm has time complexity of O(mn), where m, n are length
of strings.
• If we could calculate 10 similarity calculations a second…We would be
here for ~ 190,000 years
• Parallel: 10,000 cores … 19 years
Know which package to use for edit-based
distances
Fuzzywuzzy: string matching like a boss… but for
smaller sets only
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Metric Tree Method
We can think of names as points in some topological space
We don’t necessarily need to know absolute location of a word in a space, just the
relative distance between points
Therefore we still use a distance function (as per brute force), but define it so it
satisfies some mathematical properties:
1. d(x,y) = 0 —> x = y
2. d(x,y) = d(y,x)
3. d(x,z) <= d(x,y) + d(y,z)
This is known as a is a metric, we can save ourself time by organising the words into a
tree structure that preserves metric-distances between words
Metric Tree Method
Once we create this metric tree, we can query the nearest neighbour by
traversing the tree, blocking out “known far away words” - effectively
reducing the search space
Book
BowlHook Head
Cook Boek Bow Dead
1
2
4
1 2 1 1
Metric Tree Method
Building the tree, is well feasible with ~2.7 mln different words - O(n log(n))
Typically, all words with distance of 1 determined in ~1 sec
Build + query time still years worth of calculation
• Added problem of making a tree in parallel
• Lots of space required
• Worst case performance is actually bad
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Tokenised Method
Break name up into components (tokenising)
Many different types of tokens available: words, grams
Do this for all names in both G and S (this creates two matrices [names x tokens])
Example: Indicator function word tokeniser:
ABN RBS BANK Rabobank NV
ABN Amro
Bank
1 0 1 0 0
RBS Bank 0 1 1 0 0
Rabobank NV 0 0 0 1 1
Tokenised Method
• For given token length d:
• matrix of names in G
• matrix of names in S
• Dot product of and yields
• Row i, column j of corresponds to inner product of the tokens of the i-th word in
G and the j-th word in S
=.
Tokenised Method
• Why the dot product?
• The elements of look somewhat familiar to us:
• elements are the cosine similarity of the individual name-token vectors
multiplied by the L2 norm
• If we normalise the token-vector on creation we end up calculating the
cosine-similarity measure!
Tokenised Method
• Same number of total comparisons as brute-force
• But inner-products are cheap to calculate
• Tokenised matrices can be computed offline cheaply
• Tokenised methods allow for vectorisation and allow for increased memory
and CPU efficiency
• We can even compute this on a GPU cluster
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Preprocessing-steps turn out relatively cheap (fast),
whereas the calculation is expensive
Read data
(Hive)
Clean data
Build ‘G’ TFIDF
matrix
Build ‘S’ TFIDF
matrix
<5 mins <5 mins <5 mins xxx hours
Preprocessing
Calculate
<5 mins
Things you would wish you knew before (1/4)…
Read data
(Hive)
Runs out of memory
(or use Python 3.x ;))
Clean data
Things you would wish you knew before (2/4)…
tokenize(‘McDonaldś’)
Build ‘G’ TFIDF
matrix
Things you would wish you knew before (3/4)…
Standard token_pattern (‘(?u)bww+b’) ignores single letters
Use token_pattern (‘(?u)bw+b’) for ‘full’ tokenization
(token_pattern = u’(?u)S', ngram_range=(3, 3)) gives 3-gram matching
‘Taxibedrijf M. van Seben’ —> [‘Taxibedrijf’, ‘van’, ‘ Seben’ ]
Build ‘S’ TFIDF
matrix
Things you would wish you knew before (4/4)
Standard ‘transform’ function of Sklearn TFIDFVectorizer ignores unseen tokens
—> either transform using customized function, or tokenise on combination of G and S
match(‘JonasTheMan Nederland’) —> 100% match ‘Nederland Nederland’ ?
Calculation of cosine similarity:
matrix multiplication using Numpy/Scipy
Using Numpy and Scipy, fast Matrix multiplication of Sparse matrices. Suggested format: CSR.
.7
0
0
0
.7
1 0 0 0 0
0 .7 0 0 .7
0 0 .6 .6 .6
x
# tokens
# company
names
# of tokens (Transposed)
G S.Transpose
=
.7
.49
.42
Argmax = best match
Calculate
Look at 0.01% of the ‘G’ matrix:
what do you notice?
Input:
Sparsity: ~0.0001%
(~3 tokens per 2.6 mln columns)
Storage required: ~2 GB
Output:
Sparsity: ~0.5%
Storage required: ~10 TB
Depending on resolution, distance and eye-sight:
white dots can be seen for non-zero entries
Cruncher:
48 Cores, 512 GB RAM
Tesla:
GPUs: 3x2496 threads, 3x12 GB
Spark cluster:
150 cores, 2.5TB of memory
34
Introducing the three contestants for the
calculation part…
Numpy matrix multiplication:
first ~100 extra slices are cheap
Scipy/Numpy sparse matrix multiplication:
most expensive and highly-optimized function
Effectively using 1 core, 100 rows / iteration: ~140 matches per second
(additional memory usage: ~1 GB)
Tesla - GPU multiplication:
PyCuda is flexible, but requires deep C++ knowledge
Current custom kernel works with
Sparse Matrix x Dense Vector
(slice = 1)
Didn’t distribute the data across the GPU up-front
Using single GPU at the moment
…so, in short, further optimizations are possible!
Using 1 GPU, slice of 1 and Sparse x Dense multiplication:
~50 matches per second
Spark cluster: broadcast both sparse matrices,
use RDD with just the row-indices to work on
Driver
Step 1: push matrix G and S to workers
(broadcast variable)
Worker node
Worker node
Worker node
Step 2: distribute RDD with ‘chunks’
of row-indices: map ‘ multiply & argmax’
broadcast
G, S.T
broadcast
G, S.T
broadcast
G, S.T
Driver
Worker node
Worker node
Worker node
work on rows 0 - 9
return argmax(G.dot(S.T)) for 0-9
work on
rows 10-19
return argmax(G.dot(S.T)) for 10-19
etc.
Using standard TFIDF implementation from Spark MLLib:
vector by vector multiplication (scaleable, but slow) + hashing
Spark cluster: scales with only small modifications
to original Python code
612,630 matches in 12 containers, 12 cores/container, chunks
of 20 rows in ~5 min: 2000 matches / sec
Concluding for name-matching using Python
Ad

More Related Content

What's hot (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
SEMINARGROOT
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
Grigory Sapunov
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
ThyrixYang1
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
Neo4j
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
SEMINARGROOT
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
ThyrixYang1
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
Neo4j
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 

Viewers also liked (20)

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
NoSQLmatters
 
Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)
GoDataDriven
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
GoDataDriven
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
IBM
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
Ofer Rosenberg
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
StampedeCon
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
herbps10
 
GTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path RenderingGTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path Rendering
Mark Kilgard
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
Mark Kilgard
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
Satyendra Rana
 
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
odsc
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
inside-BigData.com
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
Deep Learning on Hadoop
Deep Learning on HadoopDeep Learning on Hadoop
Deep Learning on Hadoop
DataWorks Summit
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
Spark Summit
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
sparktc
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
Trector Rancor
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
IBM Power Systems
 
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
NoSQLmatters
 
Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)
GoDataDriven
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
GoDataDriven
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
IBM
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
StampedeCon
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
herbps10
 
GTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path RenderingGTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path Rendering
Mark Kilgard
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
Mark Kilgard
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
Satyendra Rana
 
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
odsc
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
inside-BigData.com
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
Spark Summit
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
sparktc
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
Trector Rancor
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
IBM Power Systems
 
Ad

Similar to PyData Amsterdam - Name Matching at Scale (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
rameswara reddy venkat
 
Constrained text generation to measure reading performance: A new approach ba...
Constrained text generation to measure reading performance: A new approach ba...Constrained text generation to measure reading performance: A new approach ba...
Constrained text generation to measure reading performance: A new approach ba...
Förderverein Technische Fakultät
 
modeling.ppt
modeling.pptmodeling.ppt
modeling.ppt
ssuser1d6968
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
c.titus.brown
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
"FENG "GEORGE"" YU
 
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdfbigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
sh5701
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
mark_landry
 
Overview of Genetic Algorithms in Computer Science
Overview of Genetic Algorithms in Computer ScienceOverview of Genetic Algorithms in Computer Science
Overview of Genetic Algorithms in Computer Science
ArjunPola1
 
bigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdfbigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdf
sh5701
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
FEG
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
Emerald72
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Constrained text generation to measure reading performance: A new approach ba...
Constrained text generation to measure reading performance: A new approach ba...Constrained text generation to measure reading performance: A new approach ba...
Constrained text generation to measure reading performance: A new approach ba...
Förderverein Technische Fakultät
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
"FENG "GEORGE"" YU
 
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdfbigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
sh5701
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
mark_landry
 
Overview of Genetic Algorithms in Computer Science
Overview of Genetic Algorithms in Computer ScienceOverview of Genetic Algorithms in Computer Science
Overview of Genetic Algorithms in Computer Science
ArjunPola1
 
bigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdfbigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdf
sh5701
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
FEG
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
Emerald72
 
Ad

More from GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
GoDataDriven
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
GoDataDriven
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
GoDataDriven
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
GoDataDriven
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
GoDataDriven
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
GoDataDriven
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
GoDataDriven
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
GoDataDriven
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GoDataDriven
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
GoDataDriven
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
GoDataDriven
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
GoDataDriven
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 
Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
GoDataDriven
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
GoDataDriven
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
GoDataDriven
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
GoDataDriven
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
GoDataDriven
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
GoDataDriven
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
GoDataDriven
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
GoDataDriven
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GoDataDriven
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
GoDataDriven
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
GoDataDriven
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
GoDataDriven
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 

Recently uploaded (20)

BEONBIT 2025 New.pdf.....................
BEONBIT 2025 New.pdf.....................BEONBIT 2025 New.pdf.....................
BEONBIT 2025 New.pdf.....................
sudhir9132
 
Vision Document and Business Plan of RVNL
Vision Document and Business Plan of RVNLVision Document and Business Plan of RVNL
Vision Document and Business Plan of RVNL
Rajesh Prasad
 
Introduction of E-commerce in ICT applications
Introduction of E-commerce in  ICT applicationsIntroduction of E-commerce in  ICT applications
Introduction of E-commerce in ICT applications
hammadakram562
 
How Security Guards Can Enhance Gated Community Safety.pdf
How Security Guards Can Enhance Gated Community Safety.pdfHow Security Guards Can Enhance Gated Community Safety.pdf
How Security Guards Can Enhance Gated Community Safety.pdf
Stateguard Protective Services
 
A Brief Introduction About Quynh Keiser
A Brief Introduction  About Quynh KeiserA Brief Introduction  About Quynh Keiser
A Brief Introduction About Quynh Keiser
Quynh Keiser
 
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative DevelopmentStephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand
 
Appreciation Endorsements in favour of Rajesh Prasad
Appreciation Endorsements in favour of Rajesh PrasadAppreciation Endorsements in favour of Rajesh Prasad
Appreciation Endorsements in favour of Rajesh Prasad
Rajesh Prasad
 
Presented By NAVEENA | Digital Marketing
Presented By NAVEENA | Digital MarketingPresented By NAVEENA | Digital Marketing
Presented By NAVEENA | Digital Marketing
bnaveena69
 
Are you concerned about the safety of your home and family
Are you concerned about the safety of your home and familyAre you concerned about the safety of your home and family
Are you concerned about the safety of your home and family
wasifkhan196986
 
Understanding Root Canal Treatment A Quick Guide.pptx
Understanding Root Canal Treatment A Quick Guide.pptxUnderstanding Root Canal Treatment A Quick Guide.pptx
Understanding Root Canal Treatment A Quick Guide.pptx
Dr. Nimit Garg
 
Module I Introduction to Strategic Management .pptx
Module I Introduction to Strategic Management .pptxModule I Introduction to Strategic Management .pptx
Module I Introduction to Strategic Management .pptx
Rani Channamma University, Belagavi
 
Cost Structure of Hydrogen Vehicle Manufacturing Plant
Cost Structure of Hydrogen Vehicle Manufacturing PlantCost Structure of Hydrogen Vehicle Manufacturing Plant
Cost Structure of Hydrogen Vehicle Manufacturing Plant
surajimarc0777
 
Connect with Top HR Professionals Using Data InfoMetrix HR Email List
Connect with Top HR Professionals Using Data InfoMetrix HR Email ListConnect with Top HR Professionals Using Data InfoMetrix HR Email List
Connect with Top HR Professionals Using Data InfoMetrix HR Email List
Data InfoMetrix
 
AF 122 Topic 4 The Expenditure Cycle.ppt
AF 122 Topic 4 The Expenditure Cycle.pptAF 122 Topic 4 The Expenditure Cycle.ppt
AF 122 Topic 4 The Expenditure Cycle.ppt
deuschaimen
 
IQVIA Analytics Presentation - Final Reviewed_1.0.pptx
IQVIA Analytics Presentation - Final Reviewed_1.0.pptxIQVIA Analytics Presentation - Final Reviewed_1.0.pptx
IQVIA Analytics Presentation - Final Reviewed_1.0.pptx
kcyclopediakerala
 
Software Supply Chain Security Management
Software Supply Chain Security ManagementSoftware Supply Chain Security Management
Software Supply Chain Security Management
shitalbombe
 
Holden Melia - A Seasoned Leader
Holden  Melia  -  A  Seasoned     LeaderHolden  Melia  -  A  Seasoned     Leader
Holden Melia - A Seasoned Leader
Holden Melia
 
Leadership Presentation Management Activity.pdf
Leadership Presentation Management Activity.pdfLeadership Presentation Management Activity.pdf
Leadership Presentation Management Activity.pdf
HeshamFandy1
 
AI in the Innovation Process: CTO Spring Forum 2025
AI in the Innovation Process: CTO Spring Forum 2025AI in the Innovation Process: CTO Spring Forum 2025
AI in the Innovation Process: CTO Spring Forum 2025
MIPLM
 
Why Flow Switches Are Key to Efficient Water Management.pptx
Why Flow Switches Are Key to Efficient Water Management.pptxWhy Flow Switches Are Key to Efficient Water Management.pptx
Why Flow Switches Are Key to Efficient Water Management.pptx
Grid Controls
 
BEONBIT 2025 New.pdf.....................
BEONBIT 2025 New.pdf.....................BEONBIT 2025 New.pdf.....................
BEONBIT 2025 New.pdf.....................
sudhir9132
 
Vision Document and Business Plan of RVNL
Vision Document and Business Plan of RVNLVision Document and Business Plan of RVNL
Vision Document and Business Plan of RVNL
Rajesh Prasad
 
Introduction of E-commerce in ICT applications
Introduction of E-commerce in  ICT applicationsIntroduction of E-commerce in  ICT applications
Introduction of E-commerce in ICT applications
hammadakram562
 
How Security Guards Can Enhance Gated Community Safety.pdf
How Security Guards Can Enhance Gated Community Safety.pdfHow Security Guards Can Enhance Gated Community Safety.pdf
How Security Guards Can Enhance Gated Community Safety.pdf
Stateguard Protective Services
 
A Brief Introduction About Quynh Keiser
A Brief Introduction  About Quynh KeiserA Brief Introduction  About Quynh Keiser
A Brief Introduction About Quynh Keiser
Quynh Keiser
 
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative DevelopmentStephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand: Leading Hawaii’s Movement Toward Regenerative Development
Stephane Marchand
 
Appreciation Endorsements in favour of Rajesh Prasad
Appreciation Endorsements in favour of Rajesh PrasadAppreciation Endorsements in favour of Rajesh Prasad
Appreciation Endorsements in favour of Rajesh Prasad
Rajesh Prasad
 
Presented By NAVEENA | Digital Marketing
Presented By NAVEENA | Digital MarketingPresented By NAVEENA | Digital Marketing
Presented By NAVEENA | Digital Marketing
bnaveena69
 
Are you concerned about the safety of your home and family
Are you concerned about the safety of your home and familyAre you concerned about the safety of your home and family
Are you concerned about the safety of your home and family
wasifkhan196986
 
Understanding Root Canal Treatment A Quick Guide.pptx
Understanding Root Canal Treatment A Quick Guide.pptxUnderstanding Root Canal Treatment A Quick Guide.pptx
Understanding Root Canal Treatment A Quick Guide.pptx
Dr. Nimit Garg
 
Cost Structure of Hydrogen Vehicle Manufacturing Plant
Cost Structure of Hydrogen Vehicle Manufacturing PlantCost Structure of Hydrogen Vehicle Manufacturing Plant
Cost Structure of Hydrogen Vehicle Manufacturing Plant
surajimarc0777
 
Connect with Top HR Professionals Using Data InfoMetrix HR Email List
Connect with Top HR Professionals Using Data InfoMetrix HR Email ListConnect with Top HR Professionals Using Data InfoMetrix HR Email List
Connect with Top HR Professionals Using Data InfoMetrix HR Email List
Data InfoMetrix
 
AF 122 Topic 4 The Expenditure Cycle.ppt
AF 122 Topic 4 The Expenditure Cycle.pptAF 122 Topic 4 The Expenditure Cycle.ppt
AF 122 Topic 4 The Expenditure Cycle.ppt
deuschaimen
 
IQVIA Analytics Presentation - Final Reviewed_1.0.pptx
IQVIA Analytics Presentation - Final Reviewed_1.0.pptxIQVIA Analytics Presentation - Final Reviewed_1.0.pptx
IQVIA Analytics Presentation - Final Reviewed_1.0.pptx
kcyclopediakerala
 
Software Supply Chain Security Management
Software Supply Chain Security ManagementSoftware Supply Chain Security Management
Software Supply Chain Security Management
shitalbombe
 
Holden Melia - A Seasoned Leader
Holden  Melia  -  A  Seasoned     LeaderHolden  Melia  -  A  Seasoned     Leader
Holden Melia - A Seasoned Leader
Holden Melia
 
Leadership Presentation Management Activity.pdf
Leadership Presentation Management Activity.pdfLeadership Presentation Management Activity.pdf
Leadership Presentation Management Activity.pdf
HeshamFandy1
 
AI in the Innovation Process: CTO Spring Forum 2025
AI in the Innovation Process: CTO Spring Forum 2025AI in the Innovation Process: CTO Spring Forum 2025
AI in the Innovation Process: CTO Spring Forum 2025
MIPLM
 
Why Flow Switches Are Key to Efficient Water Management.pptx
Why Flow Switches Are Key to Efficient Water Management.pptxWhy Flow Switches Are Key to Efficient Water Management.pptx
Why Flow Switches Are Key to Efficient Water Management.pptx
Grid Controls
 

PyData Amsterdam - Name Matching at Scale

  • 1. Name Matching at Scale: CPU, GPU or SPARK? Wendell Kuling and Chris Broeren ING Wholesale Banking Advanced Analytics Team
  • 2. Chris Broeren, Data Scientist Wendell Kuling, Data Scientist
  • 3. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 4. Introduction Wholesale bank = dealing with companies Interested in different data sets about companies To join multiple data sets together, we need a common key: company name However one company may be called by different name: : McDonalds Corporation, McDonalds, McDonald’s Corp, etc… Therefore we need to match approximately similar names of companies together
  • 5. Introduction Define an existing list of company names as the ground truth (G) Aim: match new sets of names (S1, S2, S3, … ) with G: Without loss of generality, let’s assume we’re going to match one set of names, S with G for this talk ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global Source 1Ground Truth ABN Amro N.V RBS LLC Rabobank N.V JPM USA ING Groep ASN Chase BINCK N.V HSBC Westpac GS Global Source 2 ABN Amro N.V RBS LLC RABOBANK NV JPM USA ING N.V. ASN Chase Bank BINCK N.V HSBC Westpac Aus GS Global Source 3 G S1 S2 S3
  • 6. Introduction Many ways to look at problem: • Approximate string match problem • Nearest Neighbour Search problem • Pattern matching • etc… We need to find the “closest” name in G to match to every name in S
  • 7. Reality In our first case: • G has 12 million names • S ranges in length between 3000 and 5 mln names To make matters worse: • On average, a name is 31 characters long, containing ~4 words • The world isn’t UTF8 compliant, we have over 160 characters • Although there are limited duplicates in G, some companies have similar names and have hierarchical structures which must be observed
  • 8. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 9. Brute Force Method Define a function to measure word closeness: The closer the names are to each other, the more similar they are Calculate closeness for each word and choose the closest Ensemble with different functions to get better results
  • 10. Brute Force Method There are many word similarity functions. An example is the Levenshtein distance. Levenshtein distance calculates the minimum number of character edits (replacing, adding or subtracting) it takes to make two strings equal. Example: levenshtein(“ABN Amro Bank”, “RBS Bank”) • ABN Amro Bank —> RBN Amro Bank (replace A with R) • RBN Amro Bank —> RBN Bank (remove Amro) • RBN Bank —> RBS Bank (replace N with S) Therefore Levenshtein(“ABN Amro Bank”, “RBS Bank”) = 1 + 4 + 1
  • 11. Brute Force Method • “ABN Amro Bank” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 12. Brute Force Method • “RBS Bank” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 13. Brute Force Method • “Goldman Sachs” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 14. Brute force method • Problem: 12 million names in G, 5 million names in S • This is 60,000,000,000,000 similarity calculations • Levenshtein algorithm has time complexity of O(mn), where m, n are length of strings. • If we could calculate 10 similarity calculations a second…We would be here for ~ 190,000 years • Parallel: 10,000 cores … 19 years
  • 15. Know which package to use for edit-based distances
  • 16. Fuzzywuzzy: string matching like a boss… but for smaller sets only
  • 17. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 18. Metric Tree Method We can think of names as points in some topological space We don’t necessarily need to know absolute location of a word in a space, just the relative distance between points Therefore we still use a distance function (as per brute force), but define it so it satisfies some mathematical properties: 1. d(x,y) = 0 —> x = y 2. d(x,y) = d(y,x) 3. d(x,z) <= d(x,y) + d(y,z) This is known as a is a metric, we can save ourself time by organising the words into a tree structure that preserves metric-distances between words
  • 19. Metric Tree Method Once we create this metric tree, we can query the nearest neighbour by traversing the tree, blocking out “known far away words” - effectively reducing the search space Book BowlHook Head Cook Boek Bow Dead 1 2 4 1 2 1 1
  • 20. Metric Tree Method Building the tree, is well feasible with ~2.7 mln different words - O(n log(n)) Typically, all words with distance of 1 determined in ~1 sec Build + query time still years worth of calculation • Added problem of making a tree in parallel • Lots of space required • Worst case performance is actually bad
  • 21. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 22. Tokenised Method Break name up into components (tokenising) Many different types of tokens available: words, grams Do this for all names in both G and S (this creates two matrices [names x tokens]) Example: Indicator function word tokeniser: ABN RBS BANK Rabobank NV ABN Amro Bank 1 0 1 0 0 RBS Bank 0 1 1 0 0 Rabobank NV 0 0 0 1 1
  • 23. Tokenised Method • For given token length d: • matrix of names in G • matrix of names in S • Dot product of and yields • Row i, column j of corresponds to inner product of the tokens of the i-th word in G and the j-th word in S =.
  • 24. Tokenised Method • Why the dot product? • The elements of look somewhat familiar to us: • elements are the cosine similarity of the individual name-token vectors multiplied by the L2 norm • If we normalise the token-vector on creation we end up calculating the cosine-similarity measure!
  • 25. Tokenised Method • Same number of total comparisons as brute-force • But inner-products are cheap to calculate • Tokenised matrices can be computed offline cheaply • Tokenised methods allow for vectorisation and allow for increased memory and CPU efficiency • We can even compute this on a GPU cluster
  • 26. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 27. Preprocessing-steps turn out relatively cheap (fast), whereas the calculation is expensive Read data (Hive) Clean data Build ‘G’ TFIDF matrix Build ‘S’ TFIDF matrix <5 mins <5 mins <5 mins xxx hours Preprocessing Calculate <5 mins
  • 28. Things you would wish you knew before (1/4)… Read data (Hive) Runs out of memory
  • 29. (or use Python 3.x ;)) Clean data Things you would wish you knew before (2/4)… tokenize(‘McDonaldś’)
  • 30. Build ‘G’ TFIDF matrix Things you would wish you knew before (3/4)… Standard token_pattern (‘(?u)bww+b’) ignores single letters Use token_pattern (‘(?u)bw+b’) for ‘full’ tokenization (token_pattern = u’(?u)S', ngram_range=(3, 3)) gives 3-gram matching ‘Taxibedrijf M. van Seben’ —> [‘Taxibedrijf’, ‘van’, ‘ Seben’ ]
  • 31. Build ‘S’ TFIDF matrix Things you would wish you knew before (4/4) Standard ‘transform’ function of Sklearn TFIDFVectorizer ignores unseen tokens —> either transform using customized function, or tokenise on combination of G and S match(‘JonasTheMan Nederland’) —> 100% match ‘Nederland Nederland’ ?
  • 32. Calculation of cosine similarity: matrix multiplication using Numpy/Scipy Using Numpy and Scipy, fast Matrix multiplication of Sparse matrices. Suggested format: CSR. .7 0 0 0 .7 1 0 0 0 0 0 .7 0 0 .7 0 0 .6 .6 .6 x # tokens # company names # of tokens (Transposed) G S.Transpose = .7 .49 .42 Argmax = best match Calculate
  • 33. Look at 0.01% of the ‘G’ matrix: what do you notice? Input: Sparsity: ~0.0001% (~3 tokens per 2.6 mln columns) Storage required: ~2 GB Output: Sparsity: ~0.5% Storage required: ~10 TB Depending on resolution, distance and eye-sight: white dots can be seen for non-zero entries
  • 34. Cruncher: 48 Cores, 512 GB RAM Tesla: GPUs: 3x2496 threads, 3x12 GB Spark cluster: 150 cores, 2.5TB of memory 34 Introducing the three contestants for the calculation part…
  • 35. Numpy matrix multiplication: first ~100 extra slices are cheap
  • 36. Scipy/Numpy sparse matrix multiplication: most expensive and highly-optimized function Effectively using 1 core, 100 rows / iteration: ~140 matches per second (additional memory usage: ~1 GB)
  • 37. Tesla - GPU multiplication: PyCuda is flexible, but requires deep C++ knowledge Current custom kernel works with Sparse Matrix x Dense Vector (slice = 1) Didn’t distribute the data across the GPU up-front Using single GPU at the moment …so, in short, further optimizations are possible! Using 1 GPU, slice of 1 and Sparse x Dense multiplication: ~50 matches per second
  • 38. Spark cluster: broadcast both sparse matrices, use RDD with just the row-indices to work on Driver Step 1: push matrix G and S to workers (broadcast variable) Worker node Worker node Worker node Step 2: distribute RDD with ‘chunks’ of row-indices: map ‘ multiply & argmax’ broadcast G, S.T broadcast G, S.T broadcast G, S.T Driver Worker node Worker node Worker node work on rows 0 - 9 return argmax(G.dot(S.T)) for 0-9 work on rows 10-19 return argmax(G.dot(S.T)) for 10-19 etc. Using standard TFIDF implementation from Spark MLLib: vector by vector multiplication (scaleable, but slow) + hashing
  • 39. Spark cluster: scales with only small modifications to original Python code 612,630 matches in 12 containers, 12 cores/container, chunks of 20 rows in ~5 min: 2000 matches / sec
  翻译: