SlideShare a Scribd company logo
Scaling Machine Learning
with Apache Spark
Holly Smith
Niall Turbitt
About
Holly Smith
Senior Consultant at Databricks
▪ Professional Services and Training
▪ Experience
▪ Credit Risk & Decisioning
▪ Mobile Banking
▪ Forecasting & Optimisation
▪ BSc Mathematics University of Greenwich
About
Niall Turbitt
Senior Data Scientist at Databricks
▪ Professional Services and Training
▪ Experience
▪ e-Commerce
▪ Supply Chain and Logistics
▪ Recommender Systems & Personalisation
▪ MS Statistics University College Dublin
▪ BA Mathematics & Economics Trinity College
Dublin
Outline
▪ Motivation
▪ Spark Architecture Recap
▪ Paradigms of ML on Spark:
■ Training & Tuning
▪ Spark MLlib
▪ Pandas Function APIs
▪ Hyperopt
■ Inference
▪ Pandas UDFs
Motivation
▪ Confusion around where and when to use
Spark for Machine Learning
▪ Spark is powerful: harness the full
potential of Spark for Machine Learning
▪ Fast moving environment
▪ Use new APIs and techniques before they become
mainstream
Motivation
▪ Confusion around where and when to use
Spark for Machine Learning
▪ Spark is powerful: harness the full
potential of Spark for Machine Learning
▪ Fast moving environment
▪ Use new APIs and techniques before they become
mainstream
Refresher: Spark Architecture
Cluster
Driver
Worker
Core Core
Core Core
= a single node
Worker
Core Core
Core Core
Worker
Core Core
Core Core
Worker
Core Core
Core Core
ML on Spark
Training Phase Model
Artefact
Inference
Model
Training Tuning
The ML Lifecycle
Predictions
Training Phase Model
Artefact
Inference
Model
Training Tuning Predictions
The ML Lifecycle
Data Parallel
Training
Distributed
MLLibrary
Training Data Nodes Model(s)
Training a single model across multiple
nodes
Creates one model artefact
ML Training & Tuning on Spark
Data Parallel
Training
Training Data Nodes Model(s)
Training a single model across multiple
nodes
Creates one model artefact
Parallel Model
Training
SingleNode
MLLibrary
Training multiple unique versions of a
model in parallel
Sending different groups of data to
each node, creating a model per group
A
C
B
ML Training & Tuning on Spark
Distributed
MLLibrary
Data Parallel
Training
Training Data Nodes Model(s)
Training a single model across multiple
nodes
Creates one model artefact
Parallel Model
Training
SingleNode
MLLibrary
Training multiple unique versions of a
model in parallel
Sending different groups of data to
each node, creating a model per group
Parallel
Hyperparameter
Optimisation
Evaluating multiple hyperparameter
configurations in parallel
Combination
𝝰
𝝲
𝝱
A
C
B
ML Training & Tuning on Spark
Distributed
MLLibrary
Model New Data Nodes Predictions
...
...
ML Inference on Spark
For both distributed and single node ML libraries:
1. Take trained model
2. Distribute new instances
3. Apply model in parallel
ML Project Considerations
▪ Data Dependent
▪ Compute Resources Available
▪ Single machine vs distributed computing
▪ Inference: Deployment Requirements
Throughput Latency Example
Batch High Hours to days Customer churn prediction
Streaming Medium Seconds to minutes Predictive maintenance
Real-time Low Milliseconds Fraud detection
▪ Spark’s Machine Learning Library
▪ ML algorithms
▪ Featurization
▪ Pipelines
▪ MLlib vs sklearn
▪ A note on terminology:
Parallelising Single-Model Training
Spark MLlib
Distributed ML Library
What is meant by “MLlib”
Spark.mllib RDD based API
Maintenance Mode
Spark.ml Dataframe based
API
Recommended
▪ Spark’s Machine Learning Library
▪ ML algorithms
▪ Featurization
▪ Pipelines
▪ MLlib vs sklearn
▪ A note on terminology:
Parallelising Single-Model Training
from pyspark.ml.regression import
LinearRegression
train_df = spark.read…
test_df = spark.read…
lr = LinearRegression().fit(train_df)
predictions = lr.transform(test_df)
Spark MLlib
Distributed ML Library
What is meant by “MLlib”
Spark.mllib RDD based API
Maintenance Mode
Spark.ml Dataframe based API
Recommended
Pandas Function API - Grouped Map
▪ DataFrame.groupby().applyInPandas()
▪ Directly apply a Python native function against a
Spark DataFrame as if each group is a Pandas
DataFrame
▪ “split-apply-combine” pattern:
▪ Split data into groups
▪ Apply function on each group
▪ Combine results into new Spark DataFrame
NEW
Parallelising training of independent models
A
C
B
Single Node ML Library
Pandas Function API - Grouped Map
▪ DataFrame.groupby().applyInPandas()
▪ Directly apply a Python native function against a
Spark DataFrame as if each group is a Pandas
DataFrame
▪ “split-apply-combine” pattern:
▪ Split data into groups
▪ Apply function on each group
▪ Combine results into new Spark DataFrame
NEW
Parallelising training of independent models
A
C
B
Single Node ML Library
def train_model(pd.Dataframe)->pd.DataFrame:
# fit single-node model
sklearn_model.fit()
…
return pandas_df
spark_df.groupBy("device_id")
.applyInPandas(train_model,
schema=return_schema)
▪ Open source hyperparameter optimisation package
▪ Enables either serial or parallel optimisation over provided search spaces
▪ Can tune both distributed and single node libraries
▪ HOWEVER: distributed training and distributed tuning don’t mix
▪ Bayesian based approach
▪ Adaptively selects new hyperparameter settings to explore based on prior results
▪ Enables exploration of the hyperparameter space in an intelligent way
▪ Allows a wider search space with more hyperparameters
Hyperopt
Parallelising Hyperparameter Optimisation
��
𝝲
𝝱
Combination
▪ Distributed Hyperopt with a single node library
▪ SparkTrials
𝝰
𝝲
𝝱
Hyperopt
Parallelising Hyperparameter Optimisation
��
𝝲
𝝱
Combination
▪ Pandas UDFs can accept an iterator of
pandas.Series or pandas.DataFrame
▪ Spark DataFrame is split into batches and the
function called for each batch
▪ Iterator negates the need to repeatedly load the
same model for every batch in the same Python
worker process
Inference
Pandas Scalar Iterator UDF
Distributing inference
▪ Pandas UDFs can accept an iterator of
pandas.Series or pandas.DataFrame
▪ Spark DataFrame is split into batches and the
function called for each batch
▪ Iterator negates the need to repeatedly load the
same model for every batch in the same Python
worker process
Inference
Pandas Scalar Iterator UDF
Distributing inference
@pandas_udf
def predict_udf(iterator):
# load model
model = ...
for features in iterator:
yield pd.Series(model.predict(features))
spark_df.withColumn("prediction",
predict_udf(*input_cols))
Training Phase Model Artefact Inference
Model Training Tuning Predictions
Conclusion
Spark MLlib
Pandas
Function API
Hyperopt Pandas Scalar
Iterator UDF
Distributing workloads allows you to scale, either by using libraries
that are multi or single node to suit your project.
Notebook: bit.ly/scaling_ml_spark_2020
Pandas UDF Blog post - bit.ly/Pandas_UDF
Docs:
■ MLlib - bit.ly/ML_lib
■ Hyperopt- bit.ly/hyperopt_spark
■ Pandas Grouped Map- bit.ly/grouped_map
Resources
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Ad

More Related Content

What's hot (20)

Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
Databricks
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Ning Jiang
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Hayim Makabee
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
Henrik Skogström
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
AllenPeter7
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
Databricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
Liangjun Jiang
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
Srivatsan Srinivasan
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
Databricks
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Ning Jiang
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Hayim Makabee
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
Databricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
Liangjun Jiang
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
Srivatsan Srinivasan
 

Similar to Scaling Machine Learning with Apache Spark (20)

Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
QuantUniversity
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
QuantUniversity
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 

Scaling Machine Learning with Apache Spark

  • 1. Scaling Machine Learning with Apache Spark Holly Smith Niall Turbitt
  • 2. About Holly Smith Senior Consultant at Databricks ▪ Professional Services and Training ▪ Experience ▪ Credit Risk & Decisioning ▪ Mobile Banking ▪ Forecasting & Optimisation ▪ BSc Mathematics University of Greenwich
  • 3. About Niall Turbitt Senior Data Scientist at Databricks ▪ Professional Services and Training ▪ Experience ▪ e-Commerce ▪ Supply Chain and Logistics ▪ Recommender Systems & Personalisation ▪ MS Statistics University College Dublin ▪ BA Mathematics & Economics Trinity College Dublin
  • 4. Outline ▪ Motivation ▪ Spark Architecture Recap ▪ Paradigms of ML on Spark: ■ Training & Tuning ▪ Spark MLlib ▪ Pandas Function APIs ▪ Hyperopt ■ Inference ▪ Pandas UDFs
  • 5. Motivation ▪ Confusion around where and when to use Spark for Machine Learning ▪ Spark is powerful: harness the full potential of Spark for Machine Learning ▪ Fast moving environment ▪ Use new APIs and techniques before they become mainstream
  • 6. Motivation ▪ Confusion around where and when to use Spark for Machine Learning ▪ Spark is powerful: harness the full potential of Spark for Machine Learning ▪ Fast moving environment ▪ Use new APIs and techniques before they become mainstream
  • 7. Refresher: Spark Architecture Cluster Driver Worker Core Core Core Core = a single node Worker Core Core Core Core Worker Core Core Core Core Worker Core Core Core Core
  • 9. Training Phase Model Artefact Inference Model Training Tuning The ML Lifecycle Predictions
  • 10. Training Phase Model Artefact Inference Model Training Tuning Predictions The ML Lifecycle
  • 11. Data Parallel Training Distributed MLLibrary Training Data Nodes Model(s) Training a single model across multiple nodes Creates one model artefact ML Training & Tuning on Spark
  • 12. Data Parallel Training Training Data Nodes Model(s) Training a single model across multiple nodes Creates one model artefact Parallel Model Training SingleNode MLLibrary Training multiple unique versions of a model in parallel Sending different groups of data to each node, creating a model per group A C B ML Training & Tuning on Spark Distributed MLLibrary
  • 13. Data Parallel Training Training Data Nodes Model(s) Training a single model across multiple nodes Creates one model artefact Parallel Model Training SingleNode MLLibrary Training multiple unique versions of a model in parallel Sending different groups of data to each node, creating a model per group Parallel Hyperparameter Optimisation Evaluating multiple hyperparameter configurations in parallel Combination 𝝰 𝝲 𝝱 A C B ML Training & Tuning on Spark Distributed MLLibrary
  • 14. Model New Data Nodes Predictions ... ... ML Inference on Spark For both distributed and single node ML libraries: 1. Take trained model 2. Distribute new instances 3. Apply model in parallel
  • 15. ML Project Considerations ▪ Data Dependent ▪ Compute Resources Available ▪ Single machine vs distributed computing ▪ Inference: Deployment Requirements Throughput Latency Example Batch High Hours to days Customer churn prediction Streaming Medium Seconds to minutes Predictive maintenance Real-time Low Milliseconds Fraud detection
  • 16. ▪ Spark’s Machine Learning Library ▪ ML algorithms ▪ Featurization ▪ Pipelines ▪ MLlib vs sklearn ▪ A note on terminology: Parallelising Single-Model Training Spark MLlib Distributed ML Library What is meant by “MLlib” Spark.mllib RDD based API Maintenance Mode Spark.ml Dataframe based API Recommended
  • 17. ▪ Spark’s Machine Learning Library ▪ ML algorithms ▪ Featurization ▪ Pipelines ▪ MLlib vs sklearn ▪ A note on terminology: Parallelising Single-Model Training from pyspark.ml.regression import LinearRegression train_df = spark.read… test_df = spark.read… lr = LinearRegression().fit(train_df) predictions = lr.transform(test_df) Spark MLlib Distributed ML Library What is meant by “MLlib” Spark.mllib RDD based API Maintenance Mode Spark.ml Dataframe based API Recommended
  • 18. Pandas Function API - Grouped Map ▪ DataFrame.groupby().applyInPandas() ▪ Directly apply a Python native function against a Spark DataFrame as if each group is a Pandas DataFrame ▪ “split-apply-combine” pattern: ▪ Split data into groups ▪ Apply function on each group ▪ Combine results into new Spark DataFrame NEW Parallelising training of independent models A C B Single Node ML Library
  • 19. Pandas Function API - Grouped Map ▪ DataFrame.groupby().applyInPandas() ▪ Directly apply a Python native function against a Spark DataFrame as if each group is a Pandas DataFrame ▪ “split-apply-combine” pattern: ▪ Split data into groups ▪ Apply function on each group ▪ Combine results into new Spark DataFrame NEW Parallelising training of independent models A C B Single Node ML Library def train_model(pd.Dataframe)->pd.DataFrame: # fit single-node model sklearn_model.fit() … return pandas_df spark_df.groupBy("device_id") .applyInPandas(train_model, schema=return_schema)
  • 20. ▪ Open source hyperparameter optimisation package ▪ Enables either serial or parallel optimisation over provided search spaces ▪ Can tune both distributed and single node libraries ▪ HOWEVER: distributed training and distributed tuning don’t mix ▪ Bayesian based approach ▪ Adaptively selects new hyperparameter settings to explore based on prior results ▪ Enables exploration of the hyperparameter space in an intelligent way ▪ Allows a wider search space with more hyperparameters Hyperopt Parallelising Hyperparameter Optimisation �� 𝝲 𝝱 Combination
  • 21. ▪ Distributed Hyperopt with a single node library ▪ SparkTrials 𝝰 𝝲 𝝱 Hyperopt Parallelising Hyperparameter Optimisation �� 𝝲 𝝱 Combination
  • 22. ▪ Pandas UDFs can accept an iterator of pandas.Series or pandas.DataFrame ▪ Spark DataFrame is split into batches and the function called for each batch ▪ Iterator negates the need to repeatedly load the same model for every batch in the same Python worker process Inference Pandas Scalar Iterator UDF Distributing inference
  • 23. ▪ Pandas UDFs can accept an iterator of pandas.Series or pandas.DataFrame ▪ Spark DataFrame is split into batches and the function called for each batch ▪ Iterator negates the need to repeatedly load the same model for every batch in the same Python worker process Inference Pandas Scalar Iterator UDF Distributing inference @pandas_udf def predict_udf(iterator): # load model model = ... for features in iterator: yield pd.Series(model.predict(features)) spark_df.withColumn("prediction", predict_udf(*input_cols))
  • 24. Training Phase Model Artefact Inference Model Training Tuning Predictions Conclusion Spark MLlib Pandas Function API Hyperopt Pandas Scalar Iterator UDF Distributing workloads allows you to scale, either by using libraries that are multi or single node to suit your project.
  • 25. Notebook: bit.ly/scaling_ml_spark_2020 Pandas UDF Blog post - bit.ly/Pandas_UDF Docs: ■ MLlib - bit.ly/ML_lib ■ Hyperopt- bit.ly/hyperopt_spark ■ Pandas Grouped Map- bit.ly/grouped_map Resources
  • 26. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  翻译: