Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Deep Dive : Spark Data Frames, SQL
and Catalyst Optimizer
Sachin Aggarwal
june13, 2016

Deep Dive : Spark Data Frames, SQL and
Catalyst Optimizer
2

Agenda
• RDD recap
• Spark SQL library
– Architecture of Spark SQL
– Comparison with Pig and Hive Pipeline
• DataFrames
– Definition of a DataFrames API
– DataFrames Operations
– DataFrames features
– Data cleansing
– Diagram for logical plan container
• Plan Optimization & Execution
– Catalyst Analyzer
– Catalyst Optimizer
– Generating Physical Plan
– Code Generation
– Extensions
3

RDD Overview
– Immutable
– distributed
– Partitioned
– Fault tolerant
– Operations applied to all Rows in dataset
– Lazily evaluated
– Can be persisted 4

Types of RDD
HDFS File
Input
FilteredRDD
MappedRDD
ShuffledRDD
MappedRDD
JSON File
Input
.filter
.map
.join
HadoopRDD
JSONRDD
.map
HDFS File
Output
.saveAsHadoopFile()
.HadoopFile()
5

Spark SQL library
• Data source API
– Universal API for Loading/ Saving structured data
• DataFrame API
– Higher level representation for structured data
• SQL interpreter and optimizer
– Express data transformation in SQL
• SQL service
– Thrift Server

Architecture of Spark SQL
JSON
Any
External
Source
PARQUET JDBC
DATASOURCE API
DATASETS/DATAFRAMES API
DSL SPARK SQL
CSV

Pig and Hive pipeline
Pig latin
Executor
Optimizer
Pig parser
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Pig latin script
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan

Issue with Pig and Hive flow
• Pig and hive shares a lot similar steps but
independent of each other
• Each project implements it’s own optimizer and
executor which prevents benefiting from each
other’s work
• There is no common data structure on which we
can build both Pig and Hive dialects
• Optimizer is not flexible to accommodate
multiple DSL’s
• Lot of duplicate effort and poor interoperability

Need for new abstraction
• Single abstraction for structured data
– Ability to combine data from multiple sources
– Uniform access from all different language API’s
– Ability to support multiple DSL’s
• Familiar interface to Data scientists
– Same API as R/ Panda
– Easy to convert like, from R local data frame to
Spark

Spark SQL pipeline
HiveQL
Hive parser
SparkQL
SparkSQL Parser
DataFrame
DSL
DataFrame
Catalyst
Hive queries Spark SQL
queries
Spark RDD code

Definition of a DataFrame API
• Single abstraction to manipulate RDDs
• Distributed collection of data organized into named columns
• RDD + Schema (evolved from SchemaRDD)
• Cross language support (Levels performance for all language)
• Data frame is a container for Logical Plan
– Logical Plan is a tree which represents data and schema
– Every transformation is represented as tree manipulation
– These trees are manipulated and optimized by catalyst rules
– Logical plan will be converted to physical plan for execution
• Introduced in 1.3
• Inspired from R and Python panda
• Robust & feature rich DSL
12

Cross language support (Faster
Implementation)
13

DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL
• Operators take expression objects
• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform traditional SQL
query strings
14

DataFrame features
• Support creation from various sources
– Native - JSON, JDBC, parquet
– 3rd party packages – csv, Cassandra etc
– Custom DataSource API
– RDD
• Schema
– Explicitly provided
• Case class
• StructType
– Inferred automatically via sampling
• Feature Rich DSL
15

DataFrame APIs
• DataFrameStatFunctions
• cov
• corr
• DataFrameNaFunctions
• fill
• drop
• replace
• Parsing
• Rules in DS API
Data cleansing
Detecting and correcting (or removing) corrupt or inaccurate records
DataFrame APIs….
• Misc
• describe
• Aggregate functions
• dropduplicates
• distinct
• count
• DataType
• cast
• date formatting in v1.5
16

Explain Command
• df.explain(true)
• Explain command on DataFrame allows us
look at these plans
• There are three types of logical plans
– Parsed logical plan
– Analysed Logical Plan
– Optimized logical Plan
• Explain also shows Physical plan
17

Diagram for logical plan container
• DF analyzed:
– df.queryExecution.analyzed.numberedTreeString)
• DF optimizedPlan:
– df.queryExecution.optimizedPlan.numberedTreeString)
• DF sparkPlan:
– df.queryExecution.sparkPlan.numberedTreeString)
18

Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution
pipeline

Optimization happens as late as
possible, therefore Spark SQL can
optimize even across functions
20

Example Query
select a.customerId from
(
select customerId , amountPaid as amount
from sales where 1 = '1’
) a
where amount=500.0
21

Catalyst Analyzer
22
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog

Parsed Plan
• This is plan generated after parsing the DSL
• Normally these plans generated by the specific
parsers like HiveQL parser, Dataframe DSL parser
etc
• Usually they recognize the different
transformations and represent them in the tree
nodes
• It’s a straightforward translation without much
tweaking
• This will be fed to analyser to generate analysed
23

Analyzed Plan
• We use sqlContext.analyser access the rules to
generate analyzed plan
• These rules has to be run in sequence to
resolve different entities in the logical plan
• Different entities to be resolved is
– Relations ( aka Table)
– References Ex : Subquery, aliases etc
– Data type casting
25

ResolveRelations Rule
• This rule resolves all the relations ( tables)
specified in the plan
• Whenever it finds a new unresolved relation,
it consults catalyst aka catalog of catalyst.
• Once it finds the relation, it resolves that with
actual
26

ResolveReferences
• This rule resolves all the references in the Plan
•
• All aliases and column names get a unique
number which allows parser to locate them
irrespective of their position
• This unique numbering allows subqueries to
removed for better optimization
28

Promote String
• This rule allows analyser to promote string to
right data types
• In our query, Filter( 1=’1’) we are comparing a
double with string
• This rule puts a cast from string to double to
have the right semantics.
30

Catalyst Optimizer
32
Logical Plan
Optimized
Logical Plan
Logical
Optimization

Eliminate Subqueries
• This rule allows analyser to eliminate
superfluous sub queries
• This is possible as we have unique identifier
for each of the references
• Removal of sub queries allows us to do
advanced optimization in subsequent steps
33

Constant Folding
• Simplifies expressions which result in constant
values
• In our plan, Filter(1=1) always results in true
• So constant folding replaces it in true
35

Simplify Filters
• This rule simplifies filters by
– Removes always true filters
– Removes entire plan subtree if filter is false
• In our query, the true Filter will be removed
• By simplifying filters, we can avoid multiple
iterations on data
37

Push Predicate Through Filter
• It’s always good to have filters near to the
data source for better optimizations
• This rules pushes the filters near to the
JsonRelation
• When we rearrange the tree nodes, we need
to make sure we rewrite the rule match the
aliases
• In our example, the filter rule is rewritten to
use alias amountPaid rather than amount
39

Project Collapsing
• Removes unnecessary projects from the plan
• In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
• So we can get rid of the second projection
• This gives us most optimized Plan
41

Generating Physical Plan
• Catalyser can take a logical plan and turn into
a physical plan or Spark plan
• On queryExecutor, we have a plan called
executedPlan which gives us physical plan
• On physical plan, we can call executeCollect or
executeTake to start evaluating the Plan
43

Code Generation
• Relies on Scala’s quasiquotes to simplify code
gen.
• Catalyst transforms a SQL tree into an abstract
syntax tree (AST) for Scala code to eval expr and
generate code
• 700LOC
Set Footer from Insert Dropdown Menu 44

Extensions
• Data Sources
– must implement a createRelation function that takes a set of
key-value params and returns a BaseRelation object.
– E.g. CSV, Avro, Parquet, JDBC
• User-Defined Types (UDTs)
– Map user-defined types to structures composed of Catalyst’s
built-in types.
45

Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer (20)

Recently uploaded (20)

Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer

Editor's Notes