SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Denny Lee, Databricks
Augmenting Machine Learning
with Databricks Labs AutoML
Toolkit
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Discuss traditional ML pipeline problem and all of
its stages
• How AutoML Toolkit solves these problems
• Hyperparameter Optimization
• Choosing Models
• Scaling AutoML Toolkit Best Practices
3#UnifiedDataAnalytics #SparkAISummit
About Speaker
Denny Lee
Developer Advocate, Databricks
• Worked with Apache Spark™ since 0.5
• Former Senior Director Data Science Engineering at
Concur
• On Project Isotope incubation team that built what
is now known as Azure HDInsight
• Former SQLCAT DW BI Lead at Microsoft
4#UnifiedDataAnalytics #SparkAISummit
Let’s start at the end
5
Let’s start at the end
• AutoML’s FeatureImportances automates the discovery
of which feature discovery
• AutoML’s AutomationRunner automates the building,
training, execution, and tuning of a Machine Learning pipeline
to create an optimal ML model.
• Improved AUC from 0.6732 to 0.995!
• Business value: $23.22M to $267.24M saved!
• Less code, faster!
6
7
ML Pipeline Stages
8
ML Pipeline Stages
Traditional ML Pipelines
Identify Important Features
9
10
Exploratory Analysis to
Identify Features
AutoML Toolkit
Identify Important Features
11
12
ML Pipeline with AutoML Toolkit
AutoML | FeatureImportances
// Calculate Feature Importance (fi)
val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost",
"classifier", genericMapOverrides)
// Since we're using XGBoost, set parallelism <= 2x number of nodes
fiConfig.tunerConfig.tunerParallelism = nodeCount * 2
val fiMainConfig =
ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig)
// Generate Feature Importance
val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0)
.generateFeatureImportances()
13
AutoML | FeatureImportances
14
15
ML Pipeline Stages
Traditional Model Building and Tuning
Building and Tuning Models
16
Hand-made Model
17
• Traditionally, when we build a ML pipeline, we will need to a number of tasks including:
• Defining our category (text-based) and numeric columns
• Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model)
• For numeric columns, ensure they are double or float data types
• For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data
• Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode)
• For example, put together imputer, stringIndexer, One-Hot encoding of category data
• Create a vector (e.g. vectorAssembler) to put together these features
• Apply a standard scaler to the values to minimize the impact of outliers
• Execute the model against our dataset
• Review the metrics (e.g., AUC)
• Tune the model using a Cross Validator
• The better you understand the model, the more likely you will provide better hyperparameters for cross validation
• i.e. need to choose a solid set of parameters (e.g. paramGrid)
• Review the metrics again (e.g. AUC)
• Review confusion matrix (in the case of binary classification)
• Review business value
Hand-made Model
18
• Traditionally, when we build a ML pipeline, we will need to a number of tasks including:
• Defining our category (text-based) and numeric columns
• Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model)
• For numeric columns, ensure they are double or float data types
• For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data
• Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode)
• For example, put together imputer, stringIndexer, One-Hot encoding of category data
• Create a vector (e.g. vectorAssembler) to put together these features
• Apply a standard scaler to the values to minimize the impact of outliers
• Execute the model against our dataset
• Review the metrics (e.g., AUC)
• Tune the model using a Cross Validator
• The better you understand the model, the more likely you will provide better hyperparameters for cross validation
• i.e. need to choose a solid set of parameters (e.g. paramGrid)
• Review the metrics again (e.g. AUC)
• Review confusion matrix (in the case of binary classification)
• Review business value
Can we make this easier?
AutoML Model Building and Tuning
Building and Tuning Models
19
20
ML Pipeline with AutoML Toolkit
AutoML | AutomationRunner
val modelingType = "XGBoost"
val conf = ConfigurationGenerator.generateConfigFromMap(modelingType,…)
// Adjust model tuner configuration
conf.tunerConfig.tunerParallelism = nodeCount
// Generate configuration
val XGBConfig = ConfigurationGenerator.generateMainConfig(conf)
// Select on the important features
val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig)
.runWithConfusionReport()
21
22
Model, Metrics, Configs Saved
AUC from 0.6732 to 0.995!
Clearing up the Confusion
23
Predicted Label Predicted Label
Hand-Made Model AutoML Model
Business Value
24
From $23.22M to $267.24M saved!
Let’s end at the end
• AutoML’s FeatureImportances automates the discovery
of which feature discovery
• AutoML’s AutomationRunner automates the building,
training, execution, and tuning of a Machine Learning pipeline
to create an optimal ML model.
• Improved AUC from 0.6732 to 0.995!
• Business value: $23.22M to $267.24M saved!
• Less code, faster!
25
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Ad

More Related Content

What's hot (20)

The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
Databricks
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Databricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Fred Madrid
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
Databricks
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Databricks
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
Databricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
Databricks
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Databricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Fred Madrid
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
Databricks
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Databricks
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
Databricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 

Similar to Augmenting Machine Learning with Databricks Labs AutoML Toolkit (20)

Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
Simone Caldaro
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
Gianmario Spacagna
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
Stitch Fix Algorithms
 
Machine learning
Machine learningMachine learning
Machine learning
Saravanan Subburayal
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Python for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptxPython for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
WO Community
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
Simone Caldaro
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
Gianmario Spacagna
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
Stitch Fix Algorithms
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Python for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptxPython for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
WO Community
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Denny Lee, Databricks Augmenting Machine Learning with Databricks Labs AutoML Toolkit #UnifiedDataAnalytics #SparkAISummit
  • 3. Agenda • Discuss traditional ML pipeline problem and all of its stages • How AutoML Toolkit solves these problems • Hyperparameter Optimization • Choosing Models • Scaling AutoML Toolkit Best Practices 3#UnifiedDataAnalytics #SparkAISummit
  • 4. About Speaker Denny Lee Developer Advocate, Databricks • Worked with Apache Spark™ since 0.5 • Former Senior Director Data Science Engineering at Concur • On Project Isotope incubation team that built what is now known as Azure HDInsight • Former SQLCAT DW BI Lead at Microsoft 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Let’s start at the end 5
  • 6. Let’s start at the end • AutoML’s FeatureImportances automates the discovery of which feature discovery • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.995! • Business value: $23.22M to $267.24M saved! • Less code, faster! 6
  • 9. Traditional ML Pipelines Identify Important Features 9
  • 12. 12 ML Pipeline with AutoML Toolkit
  • 13. AutoML | FeatureImportances // Calculate Feature Importance (fi) val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) // Since we're using XGBoost, set parallelism <= 2x number of nodes fiConfig.tunerConfig.tunerParallelism = nodeCount * 2 val fiMainConfig = ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig) // Generate Feature Importance val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0) .generateFeatureImportances() 13
  • 16. Traditional Model Building and Tuning Building and Tuning Models 16
  • 17. Hand-made Model 17 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value
  • 18. Hand-made Model 18 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value Can we make this easier?
  • 19. AutoML Model Building and Tuning Building and Tuning Models 19
  • 20. 20 ML Pipeline with AutoML Toolkit
  • 21. AutoML | AutomationRunner val modelingType = "XGBoost" val conf = ConfigurationGenerator.generateConfigFromMap(modelingType,…) // Adjust model tuner configuration conf.tunerConfig.tunerParallelism = nodeCount // Generate configuration val XGBConfig = ConfigurationGenerator.generateMainConfig(conf) // Select on the important features val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig) .runWithConfusionReport() 21
  • 22. 22 Model, Metrics, Configs Saved AUC from 0.6732 to 0.995!
  • 23. Clearing up the Confusion 23 Predicted Label Predicted Label Hand-Made Model AutoML Model
  • 24. Business Value 24 From $23.22M to $267.24M saved!
  • 25. Let’s end at the end • AutoML’s FeatureImportances automates the discovery of which feature discovery • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.995! • Business value: $23.22M to $267.24M saved! • Less code, faster! 25
  • 26. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  翻译: