SlideShare a Scribd company logo
Milan – July 13 2016
Introduction to
Distributed Computing Engines for
Data Processing
Simone Robutti
Machine Learning Engineer at Radicalbit
@SimoneRobutti
What is a Distributed Computing System
It’s the solution to the problem where your
RAM is too small and your data are too big
and/or too CPU-intensive to be processed on
a single machine.
What is a Distributed Computing System
Solution: a huge, monolithic mainframe.
What is a Distributed Computing System
Solution: a huge, monolithic mainframe.
What is a Distributed Computing System
Solution: do your job on a cluster.
Distributed vs Parallel
Parallel: execute identical tasks (with different data
or parameters).
Parallel Distributed: do this on multiple machines.
Distributed: split a big task into smaller tasks and
execute them on multiple machines
What is a Distributed Computing System
Goal: the programmer should write its
programs easily and efficiently without caring
about distribution.
Issues: a cluster is complex and conceptually
very far from a local environment.
Hadoop
What: the first OSS distributed computing engine.
When: 2006 (work began).
Where: Google.
Why: Google had a lot of data (for that time) to process. They built
a solution. Eventually it became a series of papers and got
implemented as OSS.
How: HDFS (distributed file system), MapReduce (computational
abstraction), YARN (resource and cluster manager).
MapReduce - Example
Credits to @sergejusb
Hadoop Today
Common in many enterprise environments.
Still good enough for many batch processing use cases.
HDFS and Yarn widely used by other processing engines.
i.e.:
● Log analysis
● Clickstream analysis
● Text processing
Spark
What: a more generic distributed processing engine for batch and
streaming alike.
When: 2014 (1.0 Release).
Where: Berkeley + Databricks.
Why: They aimed for faster and more general processing with
better abstractions on top.
How: InMemory computing, DAG, RDD, polyglot functional API,
libraries out-of-the-box .
Resilient Distributed Dataset
RDDs hide the underlying distribution of data with a functional API
Directed Acyclic Graph
The graph is defined by the user. The runtime translates it to
operations on distributed data.
create filter
filter join
collect
map
Spark Today
The hot topic everyone talks about.
Just entered the phase of maturity, with an already huge and fast-
growing ecosystem of libraries, integrations and tools.
Widely used as the go-to solution for Big Data (and not-so-Big) use
cases.
I.e.:
● Recommending systems
● Fraud Detection
● Attack-Detection
● Near-Real time decision-heavy solutions
Flink
What: a streaming first (with batch on top), low latency distributed
processing engine.
When: 2016 (1.0 Release).
Where: German Research Foundation + dataArtisans.
Why: a faster and flexible computational model that could
guarantee low latency, high-throughput and fault-tolerance all at the
same time.
How: streaming-first approach, checkpointing, lazy symbolic
computation, powerful optimizations.
Flink Today
Perceived as an alternative to Spark. Gaining traction for specific
use-cases (real-time streaming) but performs well on most generic
uses cases.
Solid runtime and optimization; API and ecosystem still young.
Many big companies already adopted it for fast-data applications.
I.e.:
● Real-time precise analytics (counting)
● Real-time model evaluation
● Online Learning solutions
Alternative solutions
● Apache Storm/Heron
● Apache Samza
● Apache GearPump
● Apache Apex
Following next
"The Barclays Data Science Hackathon: Building
Retail Recommender Systems based on Customer
Shopping Behaviour" by Gianmario Spacagna, Senior Data
Scientist @ Pirelli
"Data intensive applications with Apache Flink" by
Simone Robutti, Machine Learning Engineer @ Radicalbit
Ad

More Related Content

What's hot (20)

Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
TOGAF 9 Architectural Artifacts
TOGAF 9  Architectural ArtifactsTOGAF 9  Architectural Artifacts
TOGAF 9 Architectural Artifacts
Maganathin Veeraragaloo
 
Data Mesh
Data MeshData Mesh
Data Mesh
Piethein Strengholt
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
DATAVERSITY
 
Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
Christopher Bradley
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
BOSupport
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
DATAVERSITY
 
Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
BOSupport
 

Viewers also liked (20)

Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Data Science Milan
 
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2OIntroduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Data Science Milan
 
H2O for IoT - Jo-Fai (Joe) Chow, H2O
H2O for IoT - Jo-Fai (Joe) Chow, H2OH2O for IoT - Jo-Fai (Joe) Chow, H2O
H2O for IoT - Jo-Fai (Joe) Chow, H2O
Data Science Milan
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
Data Science Milan
 
Inaugural talk Data Science Milan - Gianmario Spacagna
Inaugural talk Data Science Milan - Gianmario SpacagnaInaugural talk Data Science Milan - Gianmario Spacagna
Inaugural talk Data Science Milan - Gianmario Spacagna
Data Science Milan
 
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data intensive applications with Apache Flink - Simone Robutti, RadicalbitData intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data Science Milan
 
mlk-newsletter-april-2013
mlk-newsletter-april-2013mlk-newsletter-april-2013
mlk-newsletter-april-2013
LauraOlivia OCampo
 
La revolución-verde-kappa-cornforth
La revolución-verde-kappa-cornforthLa revolución-verde-kappa-cornforth
La revolución-verde-kappa-cornforth
Sarahí Garcia
 
Planning my blog1
Planning my blog1Planning my blog1
Planning my blog1
Ximena Calle
 
Racial Profiling and Its Effects
Racial Profiling and Its EffectsRacial Profiling and Its Effects
Racial Profiling and Its Effects
Chey Bradley
 
Equipo kappa-1
Equipo kappa-1Equipo kappa-1
Equipo kappa-1
Sarahí Garcia
 
Rajesh
RajeshRajesh
Rajesh
Rajesh Babu
 
505LeePosterPresentation
505LeePosterPresentation505LeePosterPresentation
505LeePosterPresentation
Anita Louise Kariniemi
 
Britton_NoAH World Package Design 10_Page_1-10
Britton_NoAH World Package Design 10_Page_1-10Britton_NoAH World Package Design 10_Page_1-10
Britton_NoAH World Package Design 10_Page_1-10
Patti Britton
 
Instituciones administrativas del trabajo
Instituciones administrativas del trabajoInstituciones administrativas del trabajo
Instituciones administrativas del trabajo
yessihernendez
 
Tecnologias de la comunicación y su infuencia
Tecnologias de la comunicación y su infuenciaTecnologias de la comunicación y su infuencia
Tecnologias de la comunicación y su infuencia
jhon alexander garcia marin
 
Proyecto de-química
Proyecto de-químicaProyecto de-química
Proyecto de-química
Sarahí Garcia
 
Bollards
BollardsBollards
Bollards
mnfsteel
 
SMART Board PowerPoint
SMART Board PowerPointSMART Board PowerPoint
SMART Board PowerPoint
Anita Louise Kariniemi
 
Como a evolucionado la tecnología
Como a evolucionado la tecnologíaComo a evolucionado la tecnología
Como a evolucionado la tecnología
LUIS FERNANDO LEON PINTO
 
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...
Data Science Milan
 
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2OIntroduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
Data Science Milan
 
H2O for IoT - Jo-Fai (Joe) Chow, H2O
H2O for IoT - Jo-Fai (Joe) Chow, H2OH2O for IoT - Jo-Fai (Joe) Chow, H2O
H2O for IoT - Jo-Fai (Joe) Chow, H2O
Data Science Milan
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
Data Science Milan
 
Inaugural talk Data Science Milan - Gianmario Spacagna
Inaugural talk Data Science Milan - Gianmario SpacagnaInaugural talk Data Science Milan - Gianmario Spacagna
Inaugural talk Data Science Milan - Gianmario Spacagna
Data Science Milan
 
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data intensive applications with Apache Flink - Simone Robutti, RadicalbitData intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data Science Milan
 
La revolución-verde-kappa-cornforth
La revolución-verde-kappa-cornforthLa revolución-verde-kappa-cornforth
La revolución-verde-kappa-cornforth
Sarahí Garcia
 
Racial Profiling and Its Effects
Racial Profiling and Its EffectsRacial Profiling and Its Effects
Racial Profiling and Its Effects
Chey Bradley
 
Britton_NoAH World Package Design 10_Page_1-10
Britton_NoAH World Package Design 10_Page_1-10Britton_NoAH World Package Design 10_Page_1-10
Britton_NoAH World Package Design 10_Page_1-10
Patti Britton
 
Instituciones administrativas del trabajo
Instituciones administrativas del trabajoInstituciones administrativas del trabajo
Instituciones administrativas del trabajo
yessihernendez
 
Ad

Similar to Introduction to Distributed Computing Engines for Data Processing - Simone Robutti, Radicalbit (20)

The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
Alluxio, Inc.
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
ijsrd.com
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Krishna Sujeer
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
International Journal of Engineering Inventions www.ijeijournal.com
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptxHadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
International Journal of Engineering Inventions www.ijeijournal.com
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
Anusha sweety
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
Alluxio, Inc.
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
ijsrd.com
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptxHadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
Anusha sweety
 
Ad

More from Data Science Milan (20)

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
Data Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
Data Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
Data Science Milan
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
Data Science Milan
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
Data Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
Data Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Data Science Milan
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
Data Science Milan
 
ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
Data Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
Data Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
Data Science Milan
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
Data Science Milan
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
Data Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
Data Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Data Science Milan
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
Data Science Milan
 

Recently uploaded (20)

Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 

Introduction to Distributed Computing Engines for Data Processing - Simone Robutti, Radicalbit

  • 1. Milan – July 13 2016 Introduction to Distributed Computing Engines for Data Processing Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti
  • 2. What is a Distributed Computing System It’s the solution to the problem where your RAM is too small and your data are too big and/or too CPU-intensive to be processed on a single machine.
  • 3. What is a Distributed Computing System Solution: a huge, monolithic mainframe.
  • 4. What is a Distributed Computing System Solution: a huge, monolithic mainframe.
  • 5. What is a Distributed Computing System Solution: do your job on a cluster.
  • 6. Distributed vs Parallel Parallel: execute identical tasks (with different data or parameters). Parallel Distributed: do this on multiple machines. Distributed: split a big task into smaller tasks and execute them on multiple machines
  • 7. What is a Distributed Computing System Goal: the programmer should write its programs easily and efficiently without caring about distribution. Issues: a cluster is complex and conceptually very far from a local environment.
  • 8. Hadoop What: the first OSS distributed computing engine. When: 2006 (work began). Where: Google. Why: Google had a lot of data (for that time) to process. They built a solution. Eventually it became a series of papers and got implemented as OSS. How: HDFS (distributed file system), MapReduce (computational abstraction), YARN (resource and cluster manager).
  • 10. Hadoop Today Common in many enterprise environments. Still good enough for many batch processing use cases. HDFS and Yarn widely used by other processing engines. i.e.: ● Log analysis ● Clickstream analysis ● Text processing
  • 11. Spark What: a more generic distributed processing engine for batch and streaming alike. When: 2014 (1.0 Release). Where: Berkeley + Databricks. Why: They aimed for faster and more general processing with better abstractions on top. How: InMemory computing, DAG, RDD, polyglot functional API, libraries out-of-the-box .
  • 12. Resilient Distributed Dataset RDDs hide the underlying distribution of data with a functional API
  • 13. Directed Acyclic Graph The graph is defined by the user. The runtime translates it to operations on distributed data. create filter filter join collect map
  • 14. Spark Today The hot topic everyone talks about. Just entered the phase of maturity, with an already huge and fast- growing ecosystem of libraries, integrations and tools. Widely used as the go-to solution for Big Data (and not-so-Big) use cases. I.e.: ● Recommending systems ● Fraud Detection ● Attack-Detection ● Near-Real time decision-heavy solutions
  • 15. Flink What: a streaming first (with batch on top), low latency distributed processing engine. When: 2016 (1.0 Release). Where: German Research Foundation + dataArtisans. Why: a faster and flexible computational model that could guarantee low latency, high-throughput and fault-tolerance all at the same time. How: streaming-first approach, checkpointing, lazy symbolic computation, powerful optimizations.
  • 16. Flink Today Perceived as an alternative to Spark. Gaining traction for specific use-cases (real-time streaming) but performs well on most generic uses cases. Solid runtime and optimization; API and ecosystem still young. Many big companies already adopted it for fast-data applications. I.e.: ● Real-time precise analytics (counting) ● Real-time model evaluation ● Online Learning solutions
  • 17. Alternative solutions ● Apache Storm/Heron ● Apache Samza ● Apache GearPump ● Apache Apex
  • 18. Following next "The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour" by Gianmario Spacagna, Senior Data Scientist @ Pirelli "Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit
  翻译: