SlideShare a Scribd company logo
Optimization of
Recommendation
Pipelines using
Apache Spark
Hua Jiang and DB Tsai
Spark Summit SF - June 6, 2017
Netflix Scale
▪ Started streaming videos
10 years ago
▪ > 100M members
▪ > 190 countries
▪ > 1000 device types
▪ A third of peak US
downstream traffic
Turn on Netflix, and the
absolute best content for you
would automatically start playing
Recommendation System: Ideal State
Title Ranking
Everything is a RecommendationRowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image
• Try an idea offline using historical data to see if it
would have made better recommendations
• If it would, deploy a live A/B test to see if it performs
well in production
Running Experiments
Design Experiment
Collect Label Dataset
Offline Feature
Generation
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline
Experiment
Online
System
Online
AB Testing
Running Experiments
Feature Generation: Feature Computation
S3
Snapshot
Model Training
Labeled Features
Label Data
Feature Model
Feature Encoders
Required
Features Data
Features
Fact Data
Required Data
1
3
42
5
6
Version 1: RDD-Based Feature Generation
• RDD: Resilient Distributed Dataset
• Our first version was written when only RDD operations
were available
• Opacity
▪ Data are opaque
▪ Computation is opaque
Version 1: RDD-Based Feature Generation
S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Maps of Data
POJO
Feature Encoders
Features
Required Data
Label Data
Version 1: RDD-Based Feature Generation
RDD operations are at low level.
You are responsible for performance
optimization.
RDD operations are on whole objects,
even if only one field is required.
Version 2: Using DataFrame
• DataFrame: Structured Data Organized into Named
Columns
• Transparency
▪ Data are structured
▪ Computations are planned based on common patterns
Version 2: Using DataFrame
Spark SQL optimizer, Catalyst, optimizes
DataFrame operation
Version 2: Using DataFrame
S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Feature Encoders
Maps of Data
POJO
Features
Required Data
Label Data
Version 2: Using DataFrame
S3
Snapshot
Model Training
Structured Labeled Features
Structured Data in
DataFrame
Feature Model
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Version 2: Using DataFrame
~3x run time gain in feature generation
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor
Version 2: Using DataFrame
Let’s take a look at the physical plan of
the DataFrame taken from snapshot...
Version 2: Using DataFrame (with RDD[Row])
We use RDD[Row] from data frame and
create a new data frame by manipulating
the Row object.
S3
Snapshot
Structured Data in
DataFrame
Deduping Logic with
Row Manipulation
Version 2: Using DataFrame (with RDD[Row])
Even the new DataFrame, created from
RDD[Row], has columns with the same
names, they are different to Spark
Version 2: Using DataFrame (with RDD[Row])
Manipulations on row
objects are completely
opaque, blocking
optimizer from moving
operations around.
Version 3: Column Operations To The Rescue
Most of the operations are essentially
column(s) to column(s)
Version 3: Column Operations To The Rescue
Possible Replacement for row manipulations:
▪ Spark SQL Functions
▪ User-Defined Functions
▪ Catalyst Expression
Most of the operations are essentially
column(s) to column(s)
Version 3: Column Operations To The Rescue
Spark SQL Functions
(org.apache.spark.sql.functions)
▪ Built-in
▪ Highly efficient
▪ Internal data structure
▪ Code generation
▪ Supports rule-based optimization
▪ A variety of categories
▪ Aggregation
▪ Collection
▪ Math
▪ String
Version 3: Column Operations To The Rescue
User-Defined Functions (UDFs)
▪ Scala functions with certain types
▪ Highly flexible
▪ Data encoding/decoding required
Version 3: Column Operations To The Rescue
User-Defined Catalyst Expressions
▪ Flexible
▪ User defines the operations
▪ Efficient
▪ Internal data structure
▪ Code generation possible
Version 3: Column Operations To The Rescue
S3
Snapshot
Model Training
Structured Labeled Features
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Row
Manipulation
Version 3: Column Operations To The Rescue
S3
Snapshot
Model Training
Structured Labeled Features
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Catalyst
Expressions
Version 3: Column Operations To The Rescue
We replaced row manipulation with
Catalyst expression
S3
Snapshot
Structured Data in
DataFrame
case class RemoveDuplications(child: Expression) extends
UnaryExpression {
...
}
Catalyst Expression
Version 3: Column Operations To The Rescue
Version 3: Column Operations To The Rescue
Physical Plan with Column Operations
Version 3: Column Operations To The Rescue
~2x run time gain compared to version 2
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor
Conclusions
▪ Time Travel in Offline Training
▪ Fact logging + offline feature generation
▪ Optimization
▪ Remove “black boxes”
▪ Prefer high-level DataFrame APIs
▪ Prefer column operations over row manipulations
(We are hiring...)
Questions?
Ad

More Related Content

What's hot (20)

Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Using Production Profiles to Guide Optimizations
Using Production Profiles to Guide OptimizationsUsing Production Profiles to Guide Optimizations
Using Production Profiles to Guide Optimizations
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache Spark
Spark Summit
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
Databricks
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Using Production Profiles to Guide Optimizations
Using Production Profiles to Guide OptimizationsUsing Production Profiles to Guide Optimizations
Using Production Profiles to Guide Optimizations
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache Spark
Spark Summit
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
Databricks
 

Similar to Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx) (20)

Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
Kris Peeters
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
fschupp
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analyticsDay 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
Turi, Inc.
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
Anastasia Bobyreva
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
Kris Peeters
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
fschupp
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analyticsDay 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
Turi, Inc.
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Database administration and management chapter 12
Database administration and management chapter 12Database administration and management chapter 12
Database administration and management chapter 12
saniaafzalf1f2f3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
MLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglésMLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglés
FabianPierrePeaJacob
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Database administration and management chapter 12
Database administration and management chapter 12Database administration and management chapter 12
Database administration and management chapter 12
saniaafzalf1f2f3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
MLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglésMLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglés
FabianPierrePeaJacob
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 

Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)

  • 1. Optimization of Recommendation Pipelines using Apache Spark Hua Jiang and DB Tsai Spark Summit SF - June 6, 2017
  • 2. Netflix Scale ▪ Started streaming videos 10 years ago ▪ > 100M members ▪ > 190 countries ▪ > 1000 device types ▪ A third of peak US downstream traffic
  • 3. Turn on Netflix, and the absolute best content for you would automatically start playing Recommendation System: Ideal State
  • 4. Title Ranking Everything is a RecommendationRowSelection&Ordering Recommendations are driven by machine learning algorithms Over 80% of what members watch comes from our recommendations Image
  • 5. • Try an idea offline using historical data to see if it would have made better recommendations • If it would, deploy a live A/B test to see if it performs well in production Running Experiments
  • 6. Design Experiment Collect Label Dataset Offline Feature Generation Model Training Compute Validation Metrics Model Testing Design a New Experiment to Test Out Different Ideas Offline Experiment Online System Online AB Testing Running Experiments
  • 7. Feature Generation: Feature Computation S3 Snapshot Model Training Labeled Features Label Data Feature Model Feature Encoders Required Features Data Features Fact Data Required Data 1 3 42 5 6
  • 8. Version 1: RDD-Based Feature Generation • RDD: Resilient Distributed Dataset • Our first version was written when only RDD operations were available • Opacity ▪ Data are opaque ▪ Computation is opaque
  • 9. Version 1: RDD-Based Feature Generation S3 Snapshot Model Training Labeled Features RDD of POJO’s Feature Model Required Feature Maps of Data POJO Feature Encoders Features Required Data Label Data
  • 10. Version 1: RDD-Based Feature Generation RDD operations are at low level. You are responsible for performance optimization. RDD operations are on whole objects, even if only one field is required.
  • 11. Version 2: Using DataFrame • DataFrame: Structured Data Organized into Named Columns • Transparency ▪ Data are structured ▪ Computations are planned based on common patterns
  • 12. Version 2: Using DataFrame Spark SQL optimizer, Catalyst, optimizes DataFrame operation
  • 13. Version 2: Using DataFrame S3 Snapshot Model Training Labeled Features RDD of POJO’s Feature Model Required Feature Feature Encoders Maps of Data POJO Features Required Data Label Data
  • 14. Version 2: Using DataFrame S3 Snapshot Model Training Structured Labeled Features Structured Data in DataFrame Feature Model Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data
  • 15. Version 2: Using DataFrame ~3x run time gain in feature generation ▪ 50 ~ 80 executors ▪ ~3 cores per executor ▪ ~24GB per executor
  • 16. Version 2: Using DataFrame Let’s take a look at the physical plan of the DataFrame taken from snapshot...
  • 17. Version 2: Using DataFrame (with RDD[Row]) We use RDD[Row] from data frame and create a new data frame by manipulating the Row object. S3 Snapshot Structured Data in DataFrame Deduping Logic with Row Manipulation
  • 18. Version 2: Using DataFrame (with RDD[Row]) Even the new DataFrame, created from RDD[Row], has columns with the same names, they are different to Spark
  • 19. Version 2: Using DataFrame (with RDD[Row]) Manipulations on row objects are completely opaque, blocking optimizer from moving operations around.
  • 20. Version 3: Column Operations To The Rescue Most of the operations are essentially column(s) to column(s)
  • 21. Version 3: Column Operations To The Rescue Possible Replacement for row manipulations: ▪ Spark SQL Functions ▪ User-Defined Functions ▪ Catalyst Expression Most of the operations are essentially column(s) to column(s)
  • 22. Version 3: Column Operations To The Rescue Spark SQL Functions (org.apache.spark.sql.functions) ▪ Built-in ▪ Highly efficient ▪ Internal data structure ▪ Code generation ▪ Supports rule-based optimization ▪ A variety of categories ▪ Aggregation ▪ Collection ▪ Math ▪ String
  • 23. Version 3: Column Operations To The Rescue User-Defined Functions (UDFs) ▪ Scala functions with certain types ▪ Highly flexible ▪ Data encoding/decoding required
  • 24. Version 3: Column Operations To The Rescue User-Defined Catalyst Expressions ▪ Flexible ▪ User defines the operations ▪ Efficient ▪ Internal data structure ▪ Code generation possible
  • 25. Version 3: Column Operations To The Rescue S3 Snapshot Model Training Structured Labeled Features Feature Model Structured Data in DataFrame Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data Row Manipulation
  • 26. Version 3: Column Operations To The Rescue S3 Snapshot Model Training Structured Labeled Features Feature Model Structured Data in DataFrame Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data Catalyst Expressions
  • 27. Version 3: Column Operations To The Rescue We replaced row manipulation with Catalyst expression S3 Snapshot Structured Data in DataFrame case class RemoveDuplications(child: Expression) extends UnaryExpression { ... } Catalyst Expression
  • 28. Version 3: Column Operations To The Rescue
  • 29. Version 3: Column Operations To The Rescue Physical Plan with Column Operations
  • 30. Version 3: Column Operations To The Rescue ~2x run time gain compared to version 2 ▪ 50 ~ 80 executors ▪ ~3 cores per executor ▪ ~24GB per executor
  • 31. Conclusions ▪ Time Travel in Offline Training ▪ Fact logging + offline feature generation ▪ Optimization ▪ Remove “black boxes” ▪ Prefer high-level DataFrame APIs ▪ Prefer column operations over row manipulations
  翻译: