Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models:
Scaling, Workflows,
and Architecture
Joseph Bradley
Solutions Architect

About me
▪ Solutions Architect at Databricks (½ a year)
▪ Software Engineer at Databricks (5 ½ years)
▪ Apache Spark committer and PMC member

Global company with over 5,000 customers and 450+ partners
Original creators of popular data and machine learning open source projects
A unified data analytics platform for accelerating innovation across
data engineering, data science, and analytics

Hyperparameter
Model 1: “Dress”
Model 2: “Sneaker”
Why the difference?
Model 2 had better hyperparameter settings:
• Learning rate
• Model structure
• ...
What’s a hyperparameter?
• Statistical: assumptions about your
model/data
• Practical: inputs your ML library does not
learn from data
• Algorithmic: problem-dependent configs

Hyperparameter tuning
Expert knowledge
Not in this talk:
• Statistical best practices
• Overview of methods for tuning
In this talk:
• Data Science workflow best practices
• Tips for the big data and cloud computing space
Black-box tuning Ignore until needed
à See references at end!

Hyperparameter tuning is tough
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
# hyperparameters
(7 possible values each)
Fractioncoverage
Non-convex optimization
Curse of dimensionality
è Computational cost
Unintuitive hyperparameters
• Regularization
• Neural net structure
• and many more...

In this talk
Architectural patterns
Workflows
Tips and details
Getting started

Single-machine vs. distributed training
• Single-machine training
• Distributed training
• Training 1 model per group (per customer, product, etc.)

Single-machine training
Tuning
Scale out via distributed
hyperparameter tuning
à Train 1 model per Spark task
Driver
Worker WorkerWorker
Tuning
Tools for this:
• Hyperopt + SparkTrials
• sklearn + joblibspark
• Pandas UDFs

Distributed training
Driver
Worker WorkerWorker
Training
Tuning
ML
ML
Scale out via parallel
tuning
à Train 1 model at a time, or
train 2+ models in parallel
Tools for this:
• Apache Spark ML
• Hyperopt

Training one model per group
Driver
Worker WorkerWorker
Distribute
over groups
Tuning Tuning Tuning
Scale out by distributing
over groups
à Train 1 group’s model per
Spark task
Tools for this:
• Apache Spark, Pandas UDFs
• Hyperopt

Getting started
Start small
• Bias towards smaller models, fewer
iterations, etc.
• Small is cheaper, and it may suffice.
• Regardless, it gives a baseline.
Think before you tune
• Separate train/val/test sets.
• Use early stopping or smart tuning
wherever possible.
• Pick hyperparameters carefully.
• Pick ranges carefully.

Models vs. pipelines
Best practice: Set up full pipeline before tuning.
• At what point does the pipeline compute the metric you care about?
Tuning models vs. tuning pipelines
• Tuning featurization
Optimizing tuning for pipelines
• Cache intermediate results

Evaluating and iterating
Validation data and metrics
• Record many metrics on both training and validation data.
Tuning hyperparameters independently vs. jointly
• Using smarter hyperparameter search algorithms
Tracking and reproducibility
• Data, code, params, metrics, models, metadata
• Tip: Parametrize code to facilitate tracking

Handling code
Getting code to workers
• Generally simple: Pandas UDFs or
integrations (Hyperopt, etc.)
• Debugging code serialization
• Errors are often in worker logs and
look like ML library bugs (e.g., “no
module named X...”)
• Tip: For Python, import libraries
within closures
Passing configs and credentials
• E.g., MLflow active runs and
credentials
Helpful resource:
Distributed Hyperopt best practices
and troubleshooting

Moving data
Single-machine ML
• Broadcast
• Load from blob storage
• Caching data
Distributed ML
• Caching data
Blob storage data prep
• Delta Lake format
• Petastorm and TFRecords
Helpful resources:
• Distributed Hyperopt best practices
and troubleshooting
• Prepare data for distributed
training

Configuring clusters
Single-machine ML
• Sharing machine resources
• Selecting machine types
Distributed ML
• Right-sizing clusters
• Sharing a cluster
Helpful resource:
Scaling Hyperopt to Tune Machine
Learning Models in Python

Tools to know about
Apache Spark CrossValidator, TrainValidationSplit, Pandas UDFs
MLflow Tracking, Auto-logging
ML in Python Hyperopt SparkTrials & more (See last year’s SAIS talk)
Scikit-learn sklearn.model_selection, skopt, joblib+Spark
TensorFlow HParams, Keras Tuner, Model Optimization Toolkit
PyTorch Ax

Resources
Blog posts + example notebooks
• Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt
Talks
• Best Practices for Hyperparameter Tuning with MLflow (SAIS 2019)
• Advanced Hyperparameter Optimization for Deep Learning with MLflow (SAIS 2019)
Project pages + docs
• Hyperopt docs and Github
• MLflow homepage and Github
Slides: tinyurl.com/sais2020-joseph
Notebook: tinyurl.com/sais2020-joseph-demo

Thanks!
Architectural patterns
Workflows
Tips and details
Getting started
Any questions?

Tuning ML Models: Scaling, Workflows, and Architecture

Recommended

More Related Content

What's hot (20)

Similar to Tuning ML Models: Scaling, Workflows, and Architecture (20)

More from Databricks (20)

Recently uploaded (20)

Tuning ML Models: Scaling, Workflows, and Architecture