Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)

Optimization of
Recommendation
Pipelines using
Apache Spark
Hua Jiang and DB Tsai
Spark Summit SF - June 6, 2017

Netflix Scale
▪ Started streaming videos
10 years ago
▪ > 100M members
▪ > 190 countries
▪ > 1000 device types
▪ A third of peak US
downstream traffic

Turn on Netflix, and the
absolute best content for you
would automatically start playing
Recommendation System: Ideal State

Title Ranking
Everything is a RecommendationRowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image

• Try an idea offline using historical data to see if it
would have made better recommendations
• If it would, deploy a live A/B test to see if it performs
well in production
Running Experiments

Design Experiment
Collect Label Dataset
Offline Feature
Generation
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline
Experiment
Online
System
Online
AB Testing
Running Experiments

Feature Generation: Feature Computation
S3
Snapshot
Model Training
Labeled Features
Label Data
Feature Model
Feature Encoders
Required
Features Data
Features
Fact Data
Required Data
1
3
42
5
6

Version 1: RDD-Based Feature Generation
• RDD: Resilient Distributed Dataset
• Our first version was written when only RDD operations
were available
• Opacity
▪ Data are opaque
▪ Computation is opaque

S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Maps of Data
POJO
Feature Encoders
Features
Required Data
Label Data

RDD operations are at low level.
You are responsible for performance
optimization.
RDD operations are on whole objects,
even if only one field is required.

Version 2: Using DataFrame
• DataFrame: Structured Data Organized into Named
Columns
• Transparency
▪ Data are structured
▪ Computations are planned based on common patterns

Spark SQL optimizer, Catalyst, optimizes
DataFrame operation

S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Feature Encoders
Maps of Data
POJO
Features
Required Data
Label Data

S3
Snapshot
Model Training
Structured Labeled Features
Structured Data in
DataFrame
Feature Model
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data

~3x run time gain in feature generation
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor

Let’s take a look at the physical plan of
the DataFrame taken from snapshot...

Version 2: Using DataFrame (with RDD[Row])
We use RDD[Row] from data frame and
create a new data frame by manipulating
the Row object.
S3
Snapshot
Structured Data in
DataFrame
Deduping Logic with
Row Manipulation

Even the new DataFrame, created from
RDD[Row], has columns with the same
names, they are different to Spark

Manipulations on row
objects are completely
opaque, blocking
optimizer from moving
operations around.

Version 3: Column Operations To The Rescue
Most of the operations are essentially
column(s) to column(s)

Possible Replacement for row manipulations:
▪ Spark SQL Functions
▪ User-Defined Functions
▪ Catalyst Expression
Most of the operations are essentially
column(s) to column(s)

Spark SQL Functions
(org.apache.spark.sql.functions)
▪ Built-in
▪ Highly efficient
▪ Internal data structure
▪ Code generation
▪ Supports rule-based optimization
▪ A variety of categories
▪ Aggregation
▪ Collection
▪ Math
▪ String

User-Defined Functions (UDFs)
▪ Scala functions with certain types
▪ Highly flexible
▪ Data encoding/decoding required

User-Defined Catalyst Expressions
▪ Flexible
▪ User defines the operations
▪ Efficient
▪ Internal data structure
▪ Code generation possible

S3
Snapshot
Model Training
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Row
Manipulation

S3
Snapshot
Model Training
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Catalyst
Expressions

We replaced row manipulation with
Catalyst expression
S3
Snapshot
Structured Data in
DataFrame
case class RemoveDuplications(child: Expression) extends
UnaryExpression {
...
}
Catalyst Expression

Physical Plan with Column Operations

~2x run time gain compared to version 2
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor

Conclusions
▪ Time Travel in Offline Training
▪ Fact logging + offline feature generation
▪ Optimization
▪ Remove “black boxes”
▪ Prefer high-level DataFrame APIs
▪ Prefer column operations over row manipulations

Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)

Recommended

More Related Content

What's hot (20)

Similar to Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx) (20)

More from Databricks (20)

Recently uploaded (20)

Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)