SlideShare a Scribd company logo
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Deep Dive : Spark Data Frames, SQL
and Catalyst Optimizer
Sachin Aggarwal
june13, 2016
Deep Dive : Spark Data Frames, SQL and
Catalyst Optimizer
2
Agenda
• RDD recap
• Spark SQL library
– Architecture of Spark SQL
– Comparison with Pig and Hive Pipeline
• DataFrames
– Definition of a DataFrames API
– DataFrames Operations
– DataFrames features
– Data cleansing
– Diagram for logical plan container
• Plan Optimization & Execution
– Catalyst Analyzer
– Catalyst Optimizer
– Generating Physical Plan
– Code Generation
– Extensions
3
RDD Overview
– Immutable
– distributed
– Partitioned
– Fault tolerant
– Operations applied to all Rows in dataset
– Lazily evaluated
– Can be persisted 4
Types of RDD
HDFS File
Input
FilteredRDD
MappedRDD
ShuffledRDD
MappedRDD
JSON File
Input
.filter
.map
.join
HadoopRDD
JSONRDD
.map
HDFS File
Output
.saveAsHadoopFile()
.HadoopFile()
5
Spark SQL library
• Data source API
– Universal API for Loading/ Saving structured data
• DataFrame API
– Higher level representation for structured data
• SQL interpreter and optimizer
– Express data transformation in SQL
• SQL service
– Thrift Server
Architecture of Spark SQL
JSON
Any
External
Source
PARQUET JDBC
DATASOURCE API
DATASETS/DATAFRAMES API
DSL SPARK SQL
CSV
Pig and Hive pipeline
Pig latin
Executor
Optimizer
Pig parser
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Pig latin script
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Issue with Pig and Hive flow
• Pig and hive shares a lot similar steps but
independent of each other
• Each project implements it’s own optimizer and
executor which prevents benefiting from each
other’s work
• There is no common data structure on which we
can build both Pig and Hive dialects
• Optimizer is not flexible to accommodate
multiple DSL’s
• Lot of duplicate effort and poor interoperability
Need for new abstraction
• Single abstraction for structured data
– Ability to combine data from multiple sources
– Uniform access from all different language API’s
– Ability to support multiple DSL’s
• Familiar interface to Data scientists
– Same API as R/ Panda
– Easy to convert like, from R local data frame to
Spark
Spark SQL pipeline
HiveQL
Hive parser
SparkQL
SparkSQL Parser
DataFrame
DSL
DataFrame
Catalyst
Hive queries Spark SQL
queries
Spark RDD code
Definition of a DataFrame API
• Single abstraction to manipulate RDDs
• Distributed collection of data organized into named columns
• RDD + Schema (evolved from SchemaRDD)
• Cross language support (Levels performance for all language)
• Data frame is a container for Logical Plan
– Logical Plan is a tree which represents data and schema
– Every transformation is represented as tree manipulation
– These trees are manipulated and optimized by catalyst rules
– Logical plan will be converted to physical plan for execution
• Introduced in 1.3
• Inspired from R and Python panda
• Robust & feature rich DSL
12
Cross language support (Faster
Implementation)
13
DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL
• Operators take expression objects
• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform traditional SQL
query strings
14
DataFrame features
• Support creation from various sources
– Native - JSON, JDBC, parquet
– 3rd party packages – csv, Cassandra etc
– Custom DataSource API
– RDD
• Schema
– Explicitly provided
• Case class
• StructType
– Inferred automatically via sampling
• Feature Rich DSL
15
DataFrame APIs
• DataFrameStatFunctions
• cov
• corr
• DataFrameNaFunctions
• fill
• drop
• replace
• Parsing
• Rules in DS API
Data cleansing
Detecting and correcting (or removing) corrupt or inaccurate records
DataFrame APIs….
• Misc
• describe
• Aggregate functions
• dropduplicates
• distinct
• count
• DataType
• cast
• date formatting in v1.5
16
Explain Command
• df.explain(true)
• Explain command on DataFrame allows us
look at these plans
• There are three types of logical plans
– Parsed logical plan
– Analysed Logical Plan
– Optimized logical Plan
• Explain also shows Physical plan
17
Diagram for logical plan container
• DF analyzed:
– df.queryExecution.analyzed.numberedTreeString)
• DF optimizedPlan:
– df.queryExecution.optimizedPlan.numberedTreeString)
• DF sparkPlan:
– df.queryExecution.sparkPlan.numberedTreeString)
18
Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution
pipeline
Optimization happens as late as
possible, therefore Spark SQL can
optimize even across functions
20
Example Query
select a.customerId from
(
select customerId , amountPaid as amount
from sales where 1 = '1’
) a
where amount=500.0
21
Catalyst Analyzer
22
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
Parsed Plan
• This is plan generated after parsing the DSL
• Normally these plans generated by the specific
parsers like HiveQL parser, Dataframe DSL parser
etc
• Usually they recognize the different
transformations and represent them in the tree
nodes
• It’s a straightforward translation without much
tweaking
• This will be fed to analyser to generate analysed
23
Parsed Logical Plan
24
Analyzed Plan
• We use sqlContext.analyser access the rules to
generate analyzed plan
• These rules has to be run in sequence to
resolve different entities in the logical plan
• Different entities to be resolved is
– Relations ( aka Table)
– References Ex : Subquery, aliases etc
– Data type casting
25
ResolveRelations Rule
• This rule resolves all the relations ( tables)
specified in the plan
• Whenever it finds a new unresolved relation,
it consults catalyst aka catalog of catalyst.
• Once it finds the relation, it resolves that with
actual
26
27
ResolveReferences
• This rule resolves all the references in the Plan
•
• All aliases and column names get a unique
number which allows parser to locate them
irrespective of their position
• This unique numbering allows subqueries to
removed for better optimization
28
29
Promote String
• This rule allows analyser to promote string to
right data types
• In our query, Filter( 1=’1’) we are comparing a
double with string
• This rule puts a cast from string to double to
have the right semantics.
30
31
Catalyst Optimizer
32
Logical Plan
Optimized
Logical Plan
Logical
Optimization
Eliminate Subqueries
• This rule allows analyser to eliminate
superfluous sub queries
• This is possible as we have unique identifier
for each of the references
• Removal of sub queries allows us to do
advanced optimization in subsequent steps
33
34
Constant Folding
• Simplifies expressions which result in constant
values
• In our plan, Filter(1=1) always results in true
• So constant folding replaces it in true
35
36
Simplify Filters
• This rule simplifies filters by
– Removes always true filters
– Removes entire plan subtree if filter is false
• In our query, the true Filter will be removed
• By simplifying filters, we can avoid multiple
iterations on data
37
38
Push Predicate Through Filter
• It’s always good to have filters near to the
data source for better optimizations
• This rules pushes the filters near to the
JsonRelation
• When we rearrange the tree nodes, we need
to make sure we rewrite the rule match the
aliases
• In our example, the filter rule is rewritten to
use alias amountPaid rather than amount
39
40
Project Collapsing
• Removes unnecessary projects from the plan
• In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
• So we can get rid of the second projection
• This gives us most optimized Plan
41
42
Generating Physical Plan
• Catalyser can take a logical plan and turn into
a physical plan or Spark plan
• On queryExecutor, we have a plan called
executedPlan which gives us physical plan
• On physical plan, we can call executeCollect or
executeTake to start evaluating the Plan
43
Code Generation
• Relies on Scala’s quasiquotes to simplify code
gen.
• Catalyst transforms a SQL tree into an abstract
syntax tree (AST) for Scala code to eval expr and
generate code
• 700LOC
Set Footer from Insert Dropdown Menu 44
Extensions
• Data Sources
– must implement a createRelation function that takes a set of
key-value params and returns a BaseRelation object.
– E.g. CSV, Avro, Parquet, JDBC
• User-Defined Types (UDTs)
– Map user-defined types to structures composed of Catalyst’s
built-in types.
45
Ad

More Related Content

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Spark
SparkSpark
Spark
Amir Payberah
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 

Viewers also liked (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
Sachin Aggarwal
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
Alpine Data
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
Sachin Aggarwal
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Ad

Similar to Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer (20)

Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 
Ad

Recently uploaded (20)

stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
NETWAYS
 
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdfThe history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
wolfryx99
 
A Brief Introduction About John Smith
A Brief Introduction About John SmithA Brief Introduction About John Smith
A Brief Introduction About John Smith
John Smith
 
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdfstackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
NETWAYS
 
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
hershtara1
 
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
NETWAYS
 
stackconf 2025 | How Open Source Communities are Defining the Next Generation...
stackconf 2025 | How Open Source Communities are Defining the Next Generation...stackconf 2025 | How Open Source Communities are Defining the Next Generation...
stackconf 2025 | How Open Source Communities are Defining the Next Generation...
NETWAYS
 
The Mettle of Honor 05.11.2025.pptx
The  Mettle  of  Honor   05.11.2025.pptxThe  Mettle  of  Honor   05.11.2025.pptx
The Mettle of Honor 05.11.2025.pptx
FamilyWorshipCenterD
 
Guiding the Behavior of Young Children.ppt
Guiding the Behavior of Young Children.pptGuiding the Behavior of Young Children.ppt
Guiding the Behavior of Young Children.ppt
FelixOlalekanBabalol
 
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
NETWAYS
 
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
NETWAYS
 
Seasonality_Mediterranean_Cuisine.pptx. Seasonality and Popularity of Medite...
Seasonality_Mediterranean_Cuisine.pptx.  Seasonality and Popularity of Medite...Seasonality_Mediterranean_Cuisine.pptx.  Seasonality and Popularity of Medite...
Seasonality_Mediterranean_Cuisine.pptx. Seasonality and Popularity of Medite...
graycil350
 
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdfstackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
NETWAYS
 
NL-based Software Engineering (NLBSE) '25
NL-based Software Engineering (NLBSE) '25NL-based Software Engineering (NLBSE) '25
NL-based Software Engineering (NLBSE) '25
Sebastiano Panichella
 
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
BobPesakovic
 
Mastering Public Speaking: Key Skills for Confident Communication
Mastering Public Speaking: Key Skills for Confident CommunicationMastering Public Speaking: Key Skills for Confident Communication
Mastering Public Speaking: Key Skills for Confident Communication
karthikeyans20012004
 
Cross-Cultural-Communication-and-Adaptation.pdf
Cross-Cultural-Communication-and-Adaptation.pdfCross-Cultural-Communication-and-Adaptation.pdf
Cross-Cultural-Communication-and-Adaptation.pdf
rash64487
 
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdfHurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
wolfryx99
 
All_India_Situation_Presentation. by Dr Jesmina Khatun
All_India_Situation_Presentation. by Dr Jesmina KhatunAll_India_Situation_Presentation. by Dr Jesmina Khatun
All_India_Situation_Presentation. by Dr Jesmina Khatun
DRJESMINAKHATUN
 
Modernization of Parliaments: The Way Forward
Modernization of Parliaments: The Way ForwardModernization of Parliaments: The Way Forward
Modernization of Parliaments: The Way Forward
Dr. Fotios Fitsilis
 
stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
stackconf 2025 | From SBOM to Software Architecture Documentation by Philip A...
NETWAYS
 
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdfThe history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
The history of Human Rights powerpoint Andrea Giuliano Nacuzi.pdf
wolfryx99
 
A Brief Introduction About John Smith
A Brief Introduction About John SmithA Brief Introduction About John Smith
A Brief Introduction About John Smith
John Smith
 
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdfstackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
stackconf 2025 | Snakes are my new favourite by Felix Frank.pdf
NETWAYS
 
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
We Are The World-USA for Africa : Written By Lionel Richie And Michael Jackso...
hershtara1
 
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
stackconf 2025 | Building high-performance apps & controlling costs with CNCF...
NETWAYS
 
stackconf 2025 | How Open Source Communities are Defining the Next Generation...
stackconf 2025 | How Open Source Communities are Defining the Next Generation...stackconf 2025 | How Open Source Communities are Defining the Next Generation...
stackconf 2025 | How Open Source Communities are Defining the Next Generation...
NETWAYS
 
The Mettle of Honor 05.11.2025.pptx
The  Mettle  of  Honor   05.11.2025.pptxThe  Mettle  of  Honor   05.11.2025.pptx
The Mettle of Honor 05.11.2025.pptx
FamilyWorshipCenterD
 
Guiding the Behavior of Young Children.ppt
Guiding the Behavior of Young Children.pptGuiding the Behavior of Young Children.ppt
Guiding the Behavior of Young Children.ppt
FelixOlalekanBabalol
 
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
stackconf 2025 | 2025: I Don’t Know K8S and at This Point, I’m Too Afraid To ...
NETWAYS
 
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
stackconf 2025 | Building a Hyperconverged Proxmox VE Cluster with Ceph by Jo...
NETWAYS
 
Seasonality_Mediterranean_Cuisine.pptx. Seasonality and Popularity of Medite...
Seasonality_Mediterranean_Cuisine.pptx.  Seasonality and Popularity of Medite...Seasonality_Mediterranean_Cuisine.pptx.  Seasonality and Popularity of Medite...
Seasonality_Mediterranean_Cuisine.pptx. Seasonality and Popularity of Medite...
graycil350
 
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdfstackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
stackconf 2025 | Operator All the (stateful) Things by Jannik Clausen.pdf
NETWAYS
 
NL-based Software Engineering (NLBSE) '25
NL-based Software Engineering (NLBSE) '25NL-based Software Engineering (NLBSE) '25
NL-based Software Engineering (NLBSE) '25
Sebastiano Panichella
 
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
Navigating the Digital Asset Landscape-From Blockchain Foundations to Future ...
BobPesakovic
 
Mastering Public Speaking: Key Skills for Confident Communication
Mastering Public Speaking: Key Skills for Confident CommunicationMastering Public Speaking: Key Skills for Confident Communication
Mastering Public Speaking: Key Skills for Confident Communication
karthikeyans20012004
 
Cross-Cultural-Communication-and-Adaptation.pdf
Cross-Cultural-Communication-and-Adaptation.pdfCross-Cultural-Communication-and-Adaptation.pdf
Cross-Cultural-Communication-and-Adaptation.pdf
rash64487
 
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdfHurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
Hurricane Milton powerpoint Andrea Giuliano Nacuzi.pdf
wolfryx99
 
All_India_Situation_Presentation. by Dr Jesmina Khatun
All_India_Situation_Presentation. by Dr Jesmina KhatunAll_India_Situation_Presentation. by Dr Jesmina Khatun
All_India_Situation_Presentation. by Dr Jesmina Khatun
DRJESMINAKHATUN
 
Modernization of Parliaments: The Way Forward
Modernization of Parliaments: The Way ForwardModernization of Parliaments: The Way Forward
Modernization of Parliaments: The Way Forward
Dr. Fotios Fitsilis
 

Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer Sachin Aggarwal june13, 2016
  • 2. Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer 2
  • 3. Agenda • RDD recap • Spark SQL library – Architecture of Spark SQL – Comparison with Pig and Hive Pipeline • DataFrames – Definition of a DataFrames API – DataFrames Operations – DataFrames features – Data cleansing – Diagram for logical plan container • Plan Optimization & Execution – Catalyst Analyzer – Catalyst Optimizer – Generating Physical Plan – Code Generation – Extensions 3
  • 4. RDD Overview – Immutable – distributed – Partitioned – Fault tolerant – Operations applied to all Rows in dataset – Lazily evaluated – Can be persisted 4
  • 5. Types of RDD HDFS File Input FilteredRDD MappedRDD ShuffledRDD MappedRDD JSON File Input .filter .map .join HadoopRDD JSONRDD .map HDFS File Output .saveAsHadoopFile() .HadoopFile() 5
  • 6. Spark SQL library • Data source API – Universal API for Loading/ Saving structured data • DataFrame API – Higher level representation for structured data • SQL interpreter and optimizer – Express data transformation in SQL • SQL service – Thrift Server
  • 7. Architecture of Spark SQL JSON Any External Source PARQUET JDBC DATASOURCE API DATASETS/DATAFRAMES API DSL SPARK SQL CSV
  • 8. Pig and Hive pipeline Pig latin Executor Optimizer Pig parser HiveQL Hive parser Optimizer Executor Hive queries Logical Plan Optimized Logical Plan(M/R plan) Physical Plan Pig latin script Logical Plan Optimized Logical Plan(M/R plan) Physical Plan
  • 9. Issue with Pig and Hive flow • Pig and hive shares a lot similar steps but independent of each other • Each project implements it’s own optimizer and executor which prevents benefiting from each other’s work • There is no common data structure on which we can build both Pig and Hive dialects • Optimizer is not flexible to accommodate multiple DSL’s • Lot of duplicate effort and poor interoperability
  • 10. Need for new abstraction • Single abstraction for structured data – Ability to combine data from multiple sources – Uniform access from all different language API’s – Ability to support multiple DSL’s • Familiar interface to Data scientists – Same API as R/ Panda – Easy to convert like, from R local data frame to Spark
  • 11. Spark SQL pipeline HiveQL Hive parser SparkQL SparkSQL Parser DataFrame DSL DataFrame Catalyst Hive queries Spark SQL queries Spark RDD code
  • 12. Definition of a DataFrame API • Single abstraction to manipulate RDDs • Distributed collection of data organized into named columns • RDD + Schema (evolved from SchemaRDD) • Cross language support (Levels performance for all language) • Data frame is a container for Logical Plan – Logical Plan is a tree which represents data and schema – Every transformation is represented as tree manipulation – These trees are manipulated and optimized by catalyst rules – Logical plan will be converted to physical plan for execution • Introduced in 1.3 • Inspired from R and Python panda • Robust & feature rich DSL 12
  • 13. Cross language support (Faster Implementation) 13
  • 14. DataFrame Operations • Relational operations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings 14
  • 15. DataFrame features • Support creation from various sources – Native - JSON, JDBC, parquet – 3rd party packages – csv, Cassandra etc – Custom DataSource API – RDD • Schema – Explicitly provided • Case class • StructType – Inferred automatically via sampling • Feature Rich DSL 15
  • 16. DataFrame APIs • DataFrameStatFunctions • cov • corr • DataFrameNaFunctions • fill • drop • replace • Parsing • Rules in DS API Data cleansing Detecting and correcting (or removing) corrupt or inaccurate records DataFrame APIs…. • Misc • describe • Aggregate functions • dropduplicates • distinct • count • DataType • cast • date formatting in v1.5 16
  • 17. Explain Command • df.explain(true) • Explain command on DataFrame allows us look at these plans • There are three types of logical plans – Parsed logical plan – Analysed Logical Plan – Optimized logical Plan • Explain also shows Physical plan 17
  • 18. Diagram for logical plan container • DF analyzed: – df.queryExecution.analyzed.numberedTreeString) • DF optimizedPlan: – df.queryExecution.optimizedPlan.numberedTreeString) • DF sparkPlan: – df.queryExecution.sparkPlan.numberedTreeString) 18
  • 19. Plan Optimization & Execution 19 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 20. Optimization happens as late as possible, therefore Spark SQL can optimize even across functions 20
  • 21. Example Query select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1’ ) a where amount=500.0 21
  • 23. Parsed Plan • This is plan generated after parsing the DSL • Normally these plans generated by the specific parsers like HiveQL parser, Dataframe DSL parser etc • Usually they recognize the different transformations and represent them in the tree nodes • It’s a straightforward translation without much tweaking • This will be fed to analyser to generate analysed 23
  • 25. Analyzed Plan • We use sqlContext.analyser access the rules to generate analyzed plan • These rules has to be run in sequence to resolve different entities in the logical plan • Different entities to be resolved is – Relations ( aka Table) – References Ex : Subquery, aliases etc – Data type casting 25
  • 26. ResolveRelations Rule • This rule resolves all the relations ( tables) specified in the plan • Whenever it finds a new unresolved relation, it consults catalyst aka catalog of catalyst. • Once it finds the relation, it resolves that with actual 26
  • 27. 27
  • 28. ResolveReferences • This rule resolves all the references in the Plan • • All aliases and column names get a unique number which allows parser to locate them irrespective of their position • This unique numbering allows subqueries to removed for better optimization 28
  • 29. 29
  • 30. Promote String • This rule allows analyser to promote string to right data types • In our query, Filter( 1=’1’) we are comparing a double with string • This rule puts a cast from string to double to have the right semantics. 30
  • 31. 31
  • 33. Eliminate Subqueries • This rule allows analyser to eliminate superfluous sub queries • This is possible as we have unique identifier for each of the references • Removal of sub queries allows us to do advanced optimization in subsequent steps 33
  • 34. 34
  • 35. Constant Folding • Simplifies expressions which result in constant values • In our plan, Filter(1=1) always results in true • So constant folding replaces it in true 35
  • 36. 36
  • 37. Simplify Filters • This rule simplifies filters by – Removes always true filters – Removes entire plan subtree if filter is false • In our query, the true Filter will be removed • By simplifying filters, we can avoid multiple iterations on data 37
  • 38. 38
  • 39. Push Predicate Through Filter • It’s always good to have filters near to the data source for better optimizations • This rules pushes the filters near to the JsonRelation • When we rearrange the tree nodes, we need to make sure we rewrite the rule match the aliases • In our example, the filter rule is rewritten to use alias amountPaid rather than amount 39
  • 40. 40
  • 41. Project Collapsing • Removes unnecessary projects from the plan • In our plan , we don’t need second projection, i.e customerId, amount Paid as we only require one projection i.e customerId • So we can get rid of the second projection • This gives us most optimized Plan 41
  • 42. 42
  • 43. Generating Physical Plan • Catalyser can take a logical plan and turn into a physical plan or Spark plan • On queryExecutor, we have a plan called executedPlan which gives us physical plan • On physical plan, we can call executeCollect or executeTake to start evaluating the Plan 43
  • 44. Code Generation • Relies on Scala’s quasiquotes to simplify code gen. • Catalyst transforms a SQL tree into an abstract syntax tree (AST) for Scala code to eval expr and generate code • 700LOC Set Footer from Insert Dropdown Menu 44
  • 45. Extensions • Data Sources – must implement a createRelation function that takes a set of key-value params and returns a BaseRelation object. – E.g. CSV, Avro, Parquet, JDBC • User-Defined Types (UDTs) – Map user-defined types to structures composed of Catalyst’s built-in types. 45

Editor's Notes

  • #5: Immutability and partitioning RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations. Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed. Coarse grained operations Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD. Transformations and actions RDDs can only be created by reading data from a stable storage such as HDFS or by transformations on existing RDDs. All computations on RDDs are either transformations or actions. Fault Tolerance Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph. For example – firstRDD=spark.textFile("hdfs://...") secondRDD=firstRDD.filter(someFunction); thirdRDD = secondRDD.map(someFunction); result = thirdRDD.count() In case of we lose some partition of RDD , we can replay the transformation on that partition  in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations. Lazy evaluations Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked. Persistence Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.) These properties of RDDs make them useful for fast computations.
  • #15: traditional sql convenient for computing multiple things at the same time
  翻译: