Scaling Machine Learning with Apache Spark

Scaling Machine Learning
with Apache Spark
Holly Smith
Niall Turbitt

About
Holly Smith
Senior Consultant at Databricks
▪ Professional Services and Training
▪ Experience
▪ Credit Risk & Decisioning
▪ Mobile Banking
▪ Forecasting & Optimisation
▪ BSc Mathematics University of Greenwich

About
Niall Turbitt
Senior Data Scientist at Databricks
▪ Professional Services and Training
▪ Experience
▪ e-Commerce
▪ Supply Chain and Logistics
▪ Recommender Systems & Personalisation
▪ MS Statistics University College Dublin
▪ BA Mathematics & Economics Trinity College
Dublin

Outline
▪ Motivation
▪ Spark Architecture Recap
▪ Paradigms of ML on Spark:
■ Training & Tuning
▪ Spark MLlib
▪ Pandas Function APIs
▪ Hyperopt
■ Inference
▪ Pandas UDFs

Motivation
▪ Confusion around where and when to use
Spark for Machine Learning
▪ Spark is powerful: harness the full
potential of Spark for Machine Learning
▪ Fast moving environment
▪ Use new APIs and techniques before they become
mainstream

Refresher: Spark Architecture
Cluster
Driver
Worker
Core Core
Core Core
= a single node
Worker
Core Core
Core Core
Worker
Core Core
Core Core
Worker
Core Core
Core Core

Training Phase Model
Artefact
Inference
Model
Training Tuning
The ML Lifecycle
Predictions

Training Phase Model
Artefact
Inference
Model
Training Tuning Predictions
The ML Lifecycle

Data Parallel
Training
Distributed
MLLibrary
Training Data Nodes Model(s)
Training a single model across multiple
nodes
Creates one model artefact
ML Training & Tuning on Spark

Data Parallel
Training
nodes
Parallel Model
Training
SingleNode
MLLibrary
Training multiple unique versions of a
model in parallel
Sending different groups of data to
each node, creating a model per group
A
C
B
Distributed
MLLibrary

Data Parallel
Training
nodes
Parallel Model
Training
SingleNode
MLLibrary
Training multiple unique versions of a
model in parallel
Sending different groups of data to
each node, creating a model per group
Parallel
Hyperparameter
Optimisation
Evaluating multiple hyperparameter
conﬁgurations in parallel
Combination
𝝰
𝝲
𝝱
A
C
B
Distributed
MLLibrary

Model New Data Nodes Predictions
...
...
ML Inference on Spark
For both distributed and single node ML libraries:
1. Take trained model
2. Distribute new instances
3. Apply model in parallel

ML Project Considerations
▪ Data Dependent
▪ Compute Resources Available
▪ Single machine vs distributed computing
▪ Inference: Deployment Requirements
Throughput Latency Example
Batch High Hours to days Customer churn prediction
Streaming Medium Seconds to minutes Predictive maintenance
Real-time Low Milliseconds Fraud detection

▪ Spark’s Machine Learning Library
▪ ML algorithms
▪ Featurization
▪ Pipelines
▪ MLlib vs sklearn
▪ A note on terminology:
Parallelising Single-Model Training
Spark MLlib
Distributed ML Library
What is meant by “MLlib”
Spark.mllib RDD based API
Maintenance Mode
Spark.ml Dataframe based
API
Recommended

▪ Spark’s Machine Learning Library
▪ ML algorithms
▪ Featurization
▪ Pipelines
▪ MLlib vs sklearn
▪ A note on terminology:
Parallelising Single-Model Training
from pyspark.ml.regression import
LinearRegression
train_df = spark.read…
test_df = spark.read…
lr = LinearRegression().fit(train_df)
predictions = lr.transform(test_df)
Spark MLlib
Distributed ML Library
What is meant by “MLlib”
Spark.mllib RDD based API
Maintenance Mode
Spark.ml Dataframe based API
Recommended

Pandas Function API - Grouped Map
▪ DataFrame.groupby().applyInPandas()
▪ Directly apply a Python native function against a
Spark DataFrame as if each group is a Pandas
DataFrame
▪ “split-apply-combine” pattern:
▪ Split data into groups
▪ Apply function on each group
▪ Combine results into new Spark DataFrame
NEW
Parallelising training of independent models
A
C
B
Single Node ML Library

Pandas Function API - Grouped Map
▪ DataFrame.groupby().applyInPandas()
▪ Directly apply a Python native function against a
Spark DataFrame as if each group is a Pandas
DataFrame
▪ “split-apply-combine” pattern:
▪ Split data into groups
▪ Apply function on each group
▪ Combine results into new Spark DataFrame
NEW
Parallelising training of independent models
A
C
B
Single Node ML Library
def train_model(pd.Dataframe)->pd.DataFrame:
# fit single-node model
sklearn_model.fit()
…
return pandas_df
spark_df.groupBy("device_id")
.applyInPandas(train_model,
schema=return_schema)

▪ Open source hyperparameter optimisation package
▪ Enables either serial or parallel optimisation over provided search spaces
▪ Can tune both distributed and single node libraries
▪ HOWEVER: distributed training and distributed tuning don’t mix
▪ Bayesian based approach
▪ Adaptively selects new hyperparameter settings to explore based on prior results
▪ Enables exploration of the hyperparameter space in an intelligent way
▪ Allows a wider search space with more hyperparameters
Hyperopt
Parallelising Hyperparameter Optimisation
��
𝝲
𝝱
Combination

▪ Distributed Hyperopt with a single node library
▪ SparkTrials
𝝰
𝝲
𝝱
Hyperopt
Parallelising Hyperparameter Optimisation
��
𝝲
𝝱
Combination

▪ Pandas UDFs can accept an iterator of
pandas.Series or pandas.DataFrame
▪ Spark DataFrame is split into batches and the
function called for each batch
▪ Iterator negates the need to repeatedly load the
same model for every batch in the same Python
worker process
Inference
Pandas Scalar Iterator UDF
Distributing inference

▪ Pandas UDFs can accept an iterator of
pandas.Series or pandas.DataFrame
▪ Spark DataFrame is split into batches and the
function called for each batch
▪ Iterator negates the need to repeatedly load the
same model for every batch in the same Python
worker process
Inference
Pandas Scalar Iterator UDF
Distributing inference
@pandas_udf
def predict_udf(iterator):
# load model
model = ...
for features in iterator:
yield pd.Series(model.predict(features))
spark_df.withColumn("prediction",
predict_udf(*input_cols))

Training Phase Model Artefact Inference
Model Training Tuning Predictions
Conclusion
Spark MLlib
Pandas
Function API
Hyperopt Pandas Scalar
Iterator UDF
Distributing workloads allows you to scale, either by using libraries
that are multi or single node to suit your project.

Notebook: bit.ly/scaling_ml_spark_2020
Pandas UDF Blog post - bit.ly/Pandas_UDF
Docs:
■ MLlib - bit.ly/ML_lib
■ Hyperopt- bit.ly/hyperopt_spark
■ Pandas Grouped Map- bit.ly/grouped_map
Resources

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Scaling Machine Learning with Apache Spark

Recommended

More Related Content

What's hot (20)

Similar to Scaling Machine Learning with Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Scaling Machine Learning with Apache Spark