SlideShare a Scribd company logo
Mining High-Speed Data Streams
Davide Gallitelli
Politecnico di Torino – TELECOM ParisTech
@DGallitelli95
Mining High-Speed Data Streams 1
Pedro Domingos
University of Washington
Geoff Hulten
University of Washington
1. Introduction 2
Huge and Fast data streaming
1. Introduction 3
KDD systems
operating
continuously
and indefinitely
Limited by:
• Time
• Memory
• Sample Size
SPRINT
Tested on up to
a few million
examples.
Less than a
day’s worth!
41. Introduction
VERY
FAST
DECISION
TREE
Hoeffding Decision Tree
2. Hoeffding Trees 5
2. Hoeffding Trees 6
 Classical DT learners are limited by main memory size
 Probably, not all examples are needed to find the best attribute at a node
 How to decide how many are necessary? Hoeffding Bound!
«Suppose we have made 𝑛 independent observations of a variable 𝑟 with
domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that,
with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»
2. Hoeffding Trees 7
How many examples are enough?
• Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index)
• 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples
• 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n
examples
• We can compute
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• Thanks to Hoeffding Bound, we can infer that:
• ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in
heuristic measure
• This means that we can split the tree using 𝑋 𝑎, and the succeeding examples
will be passed to the new leaves (incremental approach)
82. Hoeffding Trees
• Compute the heuristic measure
for the attributes and determine
the best two attributes
• At each node chack for the
condition
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• If true, create child nodes based
on the test at the node; else, get
more examples from stream.
HT Algorithm
2. Hoeffding Trees 9
In a nutshell
• Learning in Hoeffding tree is constant time per example (instance) and
this means Hoeffding tree is suitable for data stream mining.
• Requires each example to be read at most once (incrementally built).
• With high probability, a Hoeffding tree is asymptotically identical to the
decision tree built by a batch learner.
𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤
𝛿
𝑝
• Independent of the probability
distribution generating the observations
• Built incrementally by sequential reading
• Make class predictions in parallel
• What happens with ties?
• Memory used with tree expansion
• Number of candidate attributes
goo.gl/gBnm9h
goo.gl/QvZMC7
VFDT
3. VFDT System 10
113. VFDT System
VFDT (Very Fast Decision Tree)
• Hoeffding tree algorithm implementation is VFDT
• VFDT includes refinements to the HT algorithm:
• Tie-braking algorithm
• Recompute G after a user-defined #examples
• Deactivation of inactive leaves
• Drop of unpromising early attributes (if ∆𝐺 > 𝜖)
• Bootstrap with traditional learner on a small
subset of data
• Rescan of previously-seen examples
123. VFDT System
Comparison with C4.5
𝛿 = 10−7
𝜏 = 5%
𝑛 𝑚𝑖𝑛 = 200
134. Application
A VFDT application : Web Data
• Mining the stream of Web page requests emanating
from the whole University of Washington main
campus.
• Useful to improve Web Caching, by predicting which
hosts and pages will be requested in the near future.
145. Conclusion
Future Work
• Test other applications (such as Intrusion detection)
• Use of non-discretized numeric attributes
• Use of post-pruning
• Use of adaptive δ
• Compare with other incremental algorithms (ID5R or SLIQ/SPRINT)
• Adapt to time-changing domains (concept drift)
• Parallelization
5. Conclusion 15
QUESTIONS?
5. Conclusion 16
THANK YOU!
Ad

More Related Content

What's hot (20)

Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
Dominik Seisser
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
Julien SIMON
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Big data
Big dataBig data
Big data
Harsh Kishore Mishra
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
Subash Chandra Pakhrin
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Extremely fast decision tree 論文紹介
Extremely fast decision tree 論文紹介Extremely fast decision tree 論文紹介
Extremely fast decision tree 論文紹介
Yu Sugawara
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Simplilearn
 
Decision tree
Decision treeDecision tree
Decision tree
SEMINARGROOT
 
Report for Speech Emotion Recognition
Report for Speech Emotion RecognitionReport for Speech Emotion Recognition
Report for Speech Emotion Recognition
Dongang (Sean) Wang
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
Shreshth Saxena
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
Dominik Seisser
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
Julien SIMON
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Extremely fast decision tree 論文紹介
Extremely fast decision tree 論文紹介Extremely fast decision tree 論文紹介
Extremely fast decision tree 論文紹介
Yu Sugawara
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Simplilearn
 
Report for Speech Emotion Recognition
Report for Speech Emotion RecognitionReport for Speech Emotion Recognition
Report for Speech Emotion Recognition
Dongang (Sean) Wang
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 

Similar to Mining high speed data streams: Hoeffding and VFDT (20)

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Esteban Donato
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
swy351
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
inside-BigData.com
 
Mining data streams using option trees
Mining data streams using option treesMining data streams using option trees
Mining data streams using option trees
Alexander Decker
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Mr SMAK
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
Nagasuri Bala Venkateswarlu
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014
Andrea Matsunaga
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
fmaumus
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
lohitvijayarenu
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Esteban Donato
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
swy351
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
inside-BigData.com
 
Mining data streams using option trees
Mining data streams using option treesMining data streams using option trees
Mining data streams using option trees
Alexander Decker
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Mr SMAK
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014
Andrea Matsunaga
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
fmaumus
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
lohitvijayarenu
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Ad

Recently uploaded (20)

How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Ad

Mining high speed data streams: Hoeffding and VFDT

  • 1. Mining High-Speed Data Streams Davide Gallitelli Politecnico di Torino – TELECOM ParisTech @DGallitelli95 Mining High-Speed Data Streams 1 Pedro Domingos University of Washington Geoff Hulten University of Washington
  • 2. 1. Introduction 2 Huge and Fast data streaming
  • 3. 1. Introduction 3 KDD systems operating continuously and indefinitely Limited by: • Time • Memory • Sample Size SPRINT Tested on up to a few million examples. Less than a day’s worth!
  • 5. Hoeffding Decision Tree 2. Hoeffding Trees 5
  • 6. 2. Hoeffding Trees 6  Classical DT learners are limited by main memory size  Probably, not all examples are needed to find the best attribute at a node  How to decide how many are necessary? Hoeffding Bound! «Suppose we have made 𝑛 independent observations of a variable 𝑟 with domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that, with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»
  • 7. 2. Hoeffding Trees 7 How many examples are enough? • Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index) • 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples • 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n examples • We can compute ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • Thanks to Hoeffding Bound, we can infer that: • ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in heuristic measure • This means that we can split the tree using 𝑋 𝑎, and the succeeding examples will be passed to the new leaves (incremental approach)
  • 8. 82. Hoeffding Trees • Compute the heuristic measure for the attributes and determine the best two attributes • At each node chack for the condition ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • If true, create child nodes based on the test at the node; else, get more examples from stream. HT Algorithm
  • 9. 2. Hoeffding Trees 9 In a nutshell • Learning in Hoeffding tree is constant time per example (instance) and this means Hoeffding tree is suitable for data stream mining. • Requires each example to be read at most once (incrementally built). • With high probability, a Hoeffding tree is asymptotically identical to the decision tree built by a batch learner. 𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤ 𝛿 𝑝 • Independent of the probability distribution generating the observations • Built incrementally by sequential reading • Make class predictions in parallel • What happens with ties? • Memory used with tree expansion • Number of candidate attributes goo.gl/gBnm9h goo.gl/QvZMC7
  • 11. 113. VFDT System VFDT (Very Fast Decision Tree) • Hoeffding tree algorithm implementation is VFDT • VFDT includes refinements to the HT algorithm: • Tie-braking algorithm • Recompute G after a user-defined #examples • Deactivation of inactive leaves • Drop of unpromising early attributes (if ∆𝐺 > 𝜖) • Bootstrap with traditional learner on a small subset of data • Rescan of previously-seen examples
  • 12. 123. VFDT System Comparison with C4.5 𝛿 = 10−7 𝜏 = 5% 𝑛 𝑚𝑖𝑛 = 200
  • 13. 134. Application A VFDT application : Web Data • Mining the stream of Web page requests emanating from the whole University of Washington main campus. • Useful to improve Web Caching, by predicting which hosts and pages will be requested in the near future.
  • 14. 145. Conclusion Future Work • Test other applications (such as Intrusion detection) • Use of non-discretized numeric attributes • Use of post-pruning • Use of adaptive δ • Compare with other incremental algorithms (ID5R or SLIQ/SPRINT) • Adapt to time-changing domains (concept drift) • Parallelization

Editor's Notes

  • #3: Let’s think about two situations. On the left, the smart city of the future, with thousands of sensors and control systems. On the right, present days banking systems, which generates millions of transactions per day, and are expected to grow even more as e-shopping continues to spread. Thinking about the data produced by those systems, what are its main characteristics? < change > Size and Quantity. No more standard big data analytics, but high-speed data stream mining.
  • #4: Knowledge discovery systems are constrained by three main limited resources: time, memory and sample size. In traditional applications of machine learning and statistics, sample size tends to be the dominant limitation. In contrast, in many (if not most) present-day data mining applications, the bottleneck is time and memory, not examples. The latter are typically in over-supply, in the sense that it is impossible with current KDD systems to make use of all of them within the available computational resources. Currently, the most efficient algorithms available (e.g., SPRINT or BIRCH) concentrate on making it possible to mine databases that do not fit in main memory by only requiring sequential scans of the disk. But even these algorithms have only been tested on up to a few million examples. Ideally, we would like to have KDD systems that operate continuously and indefinitely, incorporating examples as they arrive, and never losing potentially valuable information. Incremental algorithms are out there, but they are either highly sensitive to example ordering, potentially never recovering from an unfavorable set of early examples, or produce results similar to batch classification with undesired overhead in computation time.
  • #5: Introducing: VFDT, a decision-tree learning system that overcomes the shortcomings of incremental algorithms. It is I/O bound, which means it mines examples in less time than it takes to input them from the disk, it’s an anytime algorithm, meaning that the model is ready-to-use at anytime, it does not store any examples and learns by seeing them exactly once.
  • #7: Hoeffding Trees are born from the limitations of classical decision tree learners, which assume all training data can be simultaneously stored in main memory. HT is based on the assumption that, in order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, the first ones will be used to choose the root test; once the root attribute is chosen, the succeeding examples will be passed down to the corresponding leaves and used to choose the appropriate attributes there, and so on recursively. We solve the difficult problem of deciding exactly how many examples are necessary at each node by using a statistical result known as the Hoeffding bound.
  • #8: So, how do we decide how many examples are enough?
  • #10: If HTδ is the tree produced by the Hoeffding tree algorithm with desired probability δ given infinite examples (Table 1), DT∗ is the asymptotic batch tree, and p is the leaf probability, then E[∆i(HTδ, DT∗)] ≤ δ/p. The smaller δ/p , the more similar the Hoeffding tree is to a subtree of the asymptotic batch tree.
  • #12: The Hoeffding tree algorithm was implemented into Very Fast Decision Tree learner (VFDT), which includes some enhancements for practical use. In case of ties, potentially many examples will be required to decide between them with some confidence, which is wasteful since they’re basically equivalent. VFDT splits on the current best attribute. Recomputing G is actually pretty expensive. In VFDT it is possible to define a parameter for the minimum number of examples read before recomputing G. Memory was an issue for HT, meaning that the moew the tree grew, the more memory it needed. VFDT deactivates inactive leaves, only keeping track of the probability of x falling into leaf l, times the observed error rate.
  翻译: