SlideShare a Scribd company logo
Machine learning
on Hadoop data lakes
MichałIwanowski
(michal@deepsense.io)
Piotr Niedźwiedź
(piotr@deepsense.io)
About us
Michał Iwanowski
Product Director at DeepSense.io
computer science at Warsaw University of Technology
previously at IBM developing analyticaltoolkit
for machine learning and data mining
Piotr Niedźwiedź
CTO at DeepSense.io
computerscienceandmathematicsatUniversityofWarsaw
PolishAcademic Championships in Team Programming,
Google Code Jam World Finalist
previously at Google and Facebook
Speakers
DeepSense.io
founded in 2011 byACM ICPC World Champions
>110 scientists, engineers, among us winners of:
InternationalOlympics in Informatics,TopCoder Open,
Google Code Jam,ACM ICPC, Facebook Hacker Cup
solving machine learning & big data problems
for companies in the SiliconValley
data science team - ranked in top 10
of kaggle.com competitions
Machine learning
Big data
Challanges
Algorithms Technology
Algorithms
day temperature [F] weather bike rentals
3 76 Cloudy 543
4 72 Raining 173
5 78 Sunny 674
6 68 Raining 124
day temperature [F] cloudy sunny raining bike rentals
3 76 1 0 0 543
4 72 0 0 1 173
5 78 0 1 0 674
6 68 0 0 1 124
One-hot encoding
weather
cloudy sunny raining
1 0 0
0 1 0
0 0 1
id “be” “happy" “be happy” “to” “or” “not” “to be” “be or” “or not” “not to”
123 1 1 1 0 0 0 0 0 0 0
321 1 0 0 1 1 1 1 1 1 1
id message
123 “be happy”
321 “To be or not to be”
NLP:Vectorization
Vectorization obstacles
BIG FEATURE SPACE
over 106
words in the English language 1012
possible bigrams
NEW CATEGORIES
(e.g. weather = “windy”) require changes in the schema
BIGRAM / CATEGORY
HASHING FUNCTION FEATURE NUMBER
Fixed number offeatures: k k < N
Hash trick
1: „to be”
2: „be or”
3: „or nor”
4: „not to”
5: „that is”
...
N: „be happy”
1
2
...
k
Hash trick continued
We gain:
id message
123 “be happy”
321 “To be or not to be”
id Feat. 1 Feat. 2 Feat 3. ... Feat. k
123 1 0 1 ... 0
321 1 0 0 ... 1
feature space reduction
(k is determined upfront)
collisions
usuallydon’t bother us
new categories
n-grams are assigned
to existing bins
Offline learning
Training data Test streamTraining stream
Periodicalupdate
Scores
MODELTRAINING
MACHINE-LEARNING
MODEL
[new version]
MACHINE-LEARNING
MODEL
[lastest version]
PREDICTION
Offline learning problems
in reallife
models can not quickly
adapt to latest trends
not suitable for big data
big computation
infrastructure cost
=
Online learning
Modelupdate after each processed training example
Test stream
Training
stream
Scores
MODELUPDATE PREDICTION
MACHINE-LEARNING MODEL
Online learning continued
virtually unlimited
data size
Vowpal Wabbit
tool worth mentioning
out-of-core processing
hash trick
mostlylinear models
for supervised learning
functional model
obtained quickly,
then gradually improved
Trymanydifferent models with different parametrization
Tryoutvarious feature engineering methods
Use a benchmarking system // The more automated the better
Howto score great results
in a MLproblem?
Data transformations Algorithms
Neutral networks
SVM
Decision Trees
Random forest
Linear regression
Modelparameters
Alot of combinations
to explore ε = 0.05, λ = 1
ε = 0.00005
ε = 0.005, λ = 2
ε = 0.0005
λ = 4 λ = 1
Technology
Move the cost to the infrastructure
Combination search grid
Cross-validation report
+
Best models
Verification distributed
in the cluster
Modelbenchmarking tool
The architecture
WEB UI
Y
A
R
N
Worker Worker Worker
Challanges & limitations
Scikit-learn limits us to single-node models
Onlya couple of data transformations
are automaticallyevaluated
Need for a more flexible data-transforming platform
DS Studio: architecture in a nutshell
Modelbenchmarking toolcontinued,
with some crucialdifferences:
fullBig Data support
(Spark underneath)
rich toolkit of data
transformations
intuitive visual
environment
(code-free)
DS Studio UI
Michal IwanowskiPiotr Niedzwiedz
Thanks forcoming!
Meet us at our booth!
Ad

More Related Content

What's hot (20)

Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
Jorge Ferrer
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
David Murgatroyd
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Aseda Owusua Addai-Deseh
 
Icml2017 overview
Icml2017 overviewIcml2017 overview
Icml2017 overview
Tatsuya Shirakawa
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
台灣資料科學年會
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘
台灣資料科學年會
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
台灣資料科學年會
 
Learning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data AugmentationLearning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data Augmentation
Tatsuya Shirakawa
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot
 
Le Machine Learning de A à Z
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à Z
Alexia Audevart
 
Google Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlowGoogle Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlow
Harini Gunabalan
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
Marios Michailidis
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
BigDataExpo
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
Theodoros Vasiloudis
 
A friendly introduction to GANs
A friendly introduction to GANsA friendly introduction to GANs
A friendly introduction to GANs
Csongor Barabasi
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
Xavier Amatriain
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
Thomas da Silva Paula
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
Jorge Ferrer
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Aseda Owusua Addai-Deseh
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
台灣資料科學年會
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘
台灣資料科學年會
 
Learning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data AugmentationLearning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data Augmentation
Tatsuya Shirakawa
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot
 
Le Machine Learning de A à Z
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à Z
Alexia Audevart
 
Google Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlowGoogle Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlow
Harini Gunabalan
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
BigDataExpo
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
Theodoros Vasiloudis
 
A friendly introduction to GANs
A friendly introduction to GANsA friendly introduction to GANs
A friendly introduction to GANs
Csongor Barabasi
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
Xavier Amatriain
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
Thomas da Silva Paula
 

Viewers also liked (17)

Machine learning and Big Data (lecture in Polish)
Machine learning and Big Data (lecture in Polish)Machine learning and Big Data (lecture in Polish)
Machine learning and Big Data (lecture in Polish)
Michal Iwanowski
 
DataRobot project
DataRobot projectDataRobot project
DataRobot project
Ping Yin
 
Movie posters powerpoint 2
Movie posters powerpoint 2Movie posters powerpoint 2
Movie posters powerpoint 2
grewal31
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Effect of humor on project management
Effect of humor on project managementEffect of humor on project management
Effect of humor on project management
Marco Sampietro
 
Funny Pics
Funny PicsFunny Pics
Funny Pics
Marco Belzoni
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
DataRobot
 
Humour in the workplace for system
Humour in the workplace for systemHumour in the workplace for system
Humour in the workplace for system
angelis1
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1
Serhii Havrylov
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2
Serhii Havrylov
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
Nicolas Nicolov
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
Serhii Havrylov
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
DataRobot
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Machine learning and Big Data (lecture in Polish)
Machine learning and Big Data (lecture in Polish)Machine learning and Big Data (lecture in Polish)
Machine learning and Big Data (lecture in Polish)
Michal Iwanowski
 
DataRobot project
DataRobot projectDataRobot project
DataRobot project
Ping Yin
 
Movie posters powerpoint 2
Movie posters powerpoint 2Movie posters powerpoint 2
Movie posters powerpoint 2
grewal31
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Effect of humor on project management
Effect of humor on project managementEffect of humor on project management
Effect of humor on project management
Marco Sampietro
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
DataRobot
 
Humour in the workplace for system
Humour in the workplace for systemHumour in the workplace for system
Humour in the workplace for system
angelis1
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1
Serhii Havrylov
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2
Serhii Havrylov
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
Nicolas Nicolov
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
Serhii Havrylov
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
DataRobot
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
Ad

Similar to Machine learning on Hadoop data lakes (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
PyData
 
A Developer's Guide To Machine Learning
A Developer's Guide To Machine LearningA Developer's Guide To Machine Learning
A Developer's Guide To Machine Learning
Sefik Ilkin Serengil
 
Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
Jiaqi Xie
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep Features
Turi, Inc.
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
Tu Pham
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Keiichiro Ono
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.org
Dmitry Petrov
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Sacrificing the golden calf of "coding"
Sacrificing the golden calf of "coding"Sacrificing the golden calf of "coding"
Sacrificing the golden calf of "coding"
Christian Heilmann
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
Ashish Jaiman
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Codemotion
 
Open source ai_technical_trend
Open source ai_technical_trendOpen source ai_technical_trend
Open source ai_technical_trend
Mario Cho
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Codemotion Tel Aviv
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023
Hal Speed
 
Java Developers - What Lies Ahead in the AI era
Java Developers - What Lies Ahead in the AI eraJava Developers - What Lies Ahead in the AI era
Java Developers - What Lies Ahead in the AI era
Emily Jiang
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
PyData
 
A Developer's Guide To Machine Learning
A Developer's Guide To Machine LearningA Developer's Guide To Machine Learning
A Developer's Guide To Machine Learning
Sefik Ilkin Serengil
 
Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
Jiaqi Xie
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep Features
Turi, Inc.
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
Tu Pham
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Keiichiro Ono
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.org
Dmitry Petrov
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Sacrificing the golden calf of "coding"
Sacrificing the golden calf of "coding"Sacrificing the golden calf of "coding"
Sacrificing the golden calf of "coding"
Christian Heilmann
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
Ashish Jaiman
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Codemotion
 
Open source ai_technical_trend
Open source ai_technical_trendOpen source ai_technical_trend
Open source ai_technical_trend
Mario Cho
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Codemotion Tel Aviv
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023
Hal Speed
 
Java Developers - What Lies Ahead in the AI era
Java Developers - What Lies Ahead in the AI eraJava Developers - What Lies Ahead in the AI era
Java Developers - What Lies Ahead in the AI era
Emily Jiang
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 

Machine learning on Hadoop data lakes

  • 1. Machine learning on Hadoop data lakes MichałIwanowski (michal@deepsense.io) Piotr Niedźwiedź (piotr@deepsense.io)
  • 3. Michał Iwanowski Product Director at DeepSense.io computer science at Warsaw University of Technology previously at IBM developing analyticaltoolkit for machine learning and data mining Piotr Niedźwiedź CTO at DeepSense.io computerscienceandmathematicsatUniversityofWarsaw PolishAcademic Championships in Team Programming, Google Code Jam World Finalist previously at Google and Facebook Speakers
  • 4. DeepSense.io founded in 2011 byACM ICPC World Champions >110 scientists, engineers, among us winners of: InternationalOlympics in Informatics,TopCoder Open, Google Code Jam,ACM ICPC, Facebook Hacker Cup solving machine learning & big data problems for companies in the SiliconValley data science team - ranked in top 10 of kaggle.com competitions
  • 8. day temperature [F] weather bike rentals 3 76 Cloudy 543 4 72 Raining 173 5 78 Sunny 674 6 68 Raining 124 day temperature [F] cloudy sunny raining bike rentals 3 76 1 0 0 543 4 72 0 0 1 173 5 78 0 1 0 674 6 68 0 0 1 124 One-hot encoding weather cloudy sunny raining 1 0 0 0 1 0 0 0 1
  • 9. id “be” “happy" “be happy” “to” “or” “not” “to be” “be or” “or not” “not to” 123 1 1 1 0 0 0 0 0 0 0 321 1 0 0 1 1 1 1 1 1 1 id message 123 “be happy” 321 “To be or not to be” NLP:Vectorization
  • 10. Vectorization obstacles BIG FEATURE SPACE over 106 words in the English language 1012 possible bigrams NEW CATEGORIES (e.g. weather = “windy”) require changes in the schema
  • 11. BIGRAM / CATEGORY HASHING FUNCTION FEATURE NUMBER Fixed number offeatures: k k < N Hash trick 1: „to be” 2: „be or” 3: „or nor” 4: „not to” 5: „that is” ... N: „be happy” 1 2 ... k
  • 12. Hash trick continued We gain: id message 123 “be happy” 321 “To be or not to be” id Feat. 1 Feat. 2 Feat 3. ... Feat. k 123 1 0 1 ... 0 321 1 0 0 ... 1 feature space reduction (k is determined upfront) collisions usuallydon’t bother us new categories n-grams are assigned to existing bins
  • 13. Offline learning Training data Test streamTraining stream Periodicalupdate Scores MODELTRAINING MACHINE-LEARNING MODEL [new version] MACHINE-LEARNING MODEL [lastest version] PREDICTION
  • 14. Offline learning problems in reallife models can not quickly adapt to latest trends not suitable for big data big computation infrastructure cost =
  • 15. Online learning Modelupdate after each processed training example Test stream Training stream Scores MODELUPDATE PREDICTION MACHINE-LEARNING MODEL
  • 16. Online learning continued virtually unlimited data size Vowpal Wabbit tool worth mentioning out-of-core processing hash trick mostlylinear models for supervised learning functional model obtained quickly, then gradually improved
  • 17. Trymanydifferent models with different parametrization Tryoutvarious feature engineering methods Use a benchmarking system // The more automated the better Howto score great results in a MLproblem?
  • 18. Data transformations Algorithms Neutral networks SVM Decision Trees Random forest Linear regression Modelparameters Alot of combinations to explore ε = 0.05, λ = 1 ε = 0.00005 ε = 0.005, λ = 2 ε = 0.0005 λ = 4 λ = 1
  • 20. Move the cost to the infrastructure Combination search grid Cross-validation report + Best models Verification distributed in the cluster
  • 23. Challanges & limitations Scikit-learn limits us to single-node models Onlya couple of data transformations are automaticallyevaluated Need for a more flexible data-transforming platform
  • 24. DS Studio: architecture in a nutshell Modelbenchmarking toolcontinued, with some crucialdifferences: fullBig Data support (Spark underneath) rich toolkit of data transformations intuitive visual environment (code-free)
  • 26. Michal IwanowskiPiotr Niedzwiedz Thanks forcoming! Meet us at our booth!
  翻译: