SlideShare a Scribd company logo
Radical Speed for SQL
Queries on Databricks:
Photon Under the Hood
Alex Behm
Tech Lead, Databricks
Greg Rahn
Staff Product Manager, Databricks
Agenda
▪ Intro to Photon
▪ Recent Developments
▪ Up Next
▪ Summary
Introduction to Photon
Observed Workload Trends
Businesses are moving faster, and as a result
organizations spend less time in data modeling, leading
to worse performance.
▪ Most columns don’t have "NOT NULL" constraints defined
▪ Strings are convenient but slower than specific types
▪ Data lifecycle: Raw → Bronze → Silver → Gold
Can we get both agility and performance?
-- Data [Analysts | Engineers | Scientists] everywhere
Just one more ask:
SQL as a first-class citizen on
Databricks
What is Photon?
Photon is a new 100% Apache Spark compatible query engine
designed for speed and flexibility.
It’s built from the ground up to deliver the fastest performance
on modern cloud hardware for all data use cases across
data engineering, data science, machine learning, and data analytics.
• Re-architected for the fastest performance on real-world
applications
• Native C++ engine for faster queries
• Custom built memory management to avoid JVM bottlenecks
• Vectorized: memory, instruction, and data parallelism (SIMD)
• Works with your existing code and avoids vendor lock-in
• 100% compatible with open source Spark DataFrame APIs and Spark SQL
• Transparent operation to users - no need to invoke something new, it just works
• Optimizing for all data use cases and workloads
• Today, supporting SQL and DataFrame workloads
• Coming soon, Streaming, Data Science, and more
Building the next generation query engine
Why build a new execution engine?
● Parsing
● Catalyst: Analysis/Planning/Optimization
● Scheduling
Execute Task
Client: Submit SQL Query
Execute Task Execute Task Execute Task Spark Executors
Mixed
JVM/Native
Spark Driver
JVM
Photon in the Databricks Lakehouse Platform
Delta Lake
1
0
1
0
1
0
1
0
1
0
1
0
• Hybrid Photon/Spark Plans
• Use Photon when possible, fall back to Spark for unsupported operations
• Completely transparent to users
• Native code using off-heap memory
• Natural access to memory and intrinsics (no fiddling with Java Unsafe)
• No JVM GC, large heaps ok
• No JVM JIT performance cliffs / limitations
• Fully integrated with Spark’s memory manager
• Prefers hash join over sort-merge join
• Rich per-operator performance metrics
Key Photon Characteristics
Recent Developments in Photon
Development Focus Areas
1. Production Readiness
a. Goal: Resilience comparable to DBR → spilling support
b. Testing and hardening, real customer workloads
2. Query Coverage
a. Today: Basics like joins/aggregations/shuffle, common types and functions
b. In development: Nested types, built-in functions
c. Coming soon: Sort/Window
3. Performance
a. Analyze and optimize common usage patterns
Disclaimer: Microbenchmarks
Microbenchmarks do not necessarily reflect
real-world end-to-end performance
During Photon development we analyze and optimize
performance with extensive microbenchmarks
In the following slides, we share benchmark results that
were run in controlled and narrowly scoped scenarios
Resilience with Very Large Inputs
• Spilling for very large inputs
• Write intermediate state to external storage to process
inputs exceeding available memory
✅ Hash Shuffle
✅ Hash Aggregation
✅ Hash Join
2-5x Speedup
Example: Spilling Hash Join [1 of 4]
Partitioned Hash Table
• Hash join has two phases
• build and probe
• Build phase: insert records
from one join input into the
hash table
• Hash table has a fixed
number of partitions
Example: Spilling Hash Join [2 of 4]
• When memory runs out spill
one partition to disk
• New records go to
in-memory partitions or
straight to disk
• Repeat until build is done
Partitioned Hash Table
Example: Spilling Hash Join [3 of 4]
• Probe phase: process
rows from other join input
• Emit results for probe
rows matching in-memory
build partitions
• Spill probe rows matching
a spilled build partition
Partitioned Hash Table
Build
Probe
Example: Spilling Hash Join [4 of 4]
• For each spilled partition,
repeat the same
build/probe process
• Might spill again! Apply
same algorithm recursively
Build
Probe
⨝
Spilling Hash Join vs. Spilling Sort-Merge Join
• Photon converts Sort-Merge Joins to Hash Joins
• Sort Merge Join
• Buffer + sort both join inputs, increasing memory pressure
• Spilling sort → write entire input to sorted runs
• Hash Join
• Only buffer build input (typically the smaller input) in a hash table
• Graceful degradation: Spill both inputs at the build-partition granularity
• Role reversal: Swap build/probe when processing spilled partitions
Up to 5x Speedup
Hardening: How we test Photon
• Random queries and data
• Using new open-source Spark random query generator
• Failure injection
• Randomly trip error paths to ensure graceful query failure
• Spill injection
• Randomly trigger spill events to simulate memory pressure
• Clang/LLVM C++ tools
• Address Sanitizer
• Undefined Behavior Sanitizer
• Combinations of the above
🐛
🔨
Query Coverage
Overview of Query Coverage
Data Types Operators
✅ Byte/Short/Int/Long
✅ Boolean
✅ String/Binary
✅ Decimal
✅ Float/Double
✅ Date/Timestamp
✅ Struct
Coming soon: Array, Map
✅ Scan, Filter, Project
✅ Hash Aggregate/Join/Shuffle
✅ Nested-Loop Join
✅ Null-Aware Anti Join
✅ Union, Expand, ScalarSubquery
Coming soon: Sort, Window
Expressions
✅ Comparison / Logic
✅ Arithmetic / Math (most)
✅ Conditional (IF, CASE, etc.)
✅ String (common ones)
✅ Casts
✅ Aggregates (most common
ones)
✅ Date/Timestamp (in progress)
Coming soon: UDFs, long tail
Expression Coverage for DATE/TIMESTAMP
• Many queries contain date/timestamp logic
• As of today: 95% coverage (100% very soon)
• Fast path for UTC timezone (default)
• Some expressions are very complicated to implement
• Individual functions run in Spark, but still run the operator/plan in Photon
Microbenchmarks do not necessarily reflect speedups on end-to-end queries, functions optimized for UTC timezone, your mileage may vary
Nested/Complex Type Support
• ✅ Struct
• Array / Map, in active development
• Reading data and basic usage/functions work
• In progress: collect_list() / collect_set()
• Long tail of array expressions
Microbenchmarks do not necessarily reflect speedups on end-to-end queries, your mileage may vary
• Currently supports all scalar types and Struct
• Array/Map in active development
• Can be turned on/off independently of Photon
• spark.databricks.photon.parquetWriter.enabled = true
• Typical speedups: 2-4x
• Wider (>100 columns) tables can see even more gains
Writing Delta/Parquet Data
DML Support [DELETE / UPDATE / MERGE]
• Bulk of work like joins/aggregations run in Photon
• Benefits from Photon Delta/Parquet writing capability
• Typical speedups: 2-3x
ANSI SQL Support
• Development in tandem with open-source Spark
• Fail queries on overflow or similar errors
Photon: What's Next
Current/Up Next Efforts in Photon
• Finishing nested type support, including writes
• Outstanding ANSI SQL behaviors
• Sort and Window operators
• Support for bucketed tables
How to use Photon today
● Enable Photon via Workspace cluster
● Notebook or JAR
● Available on: AWS
● Not supported yet
○ UDFs
○ Streaming
● Photon via Databricks SQL
● Redash
● Tableau
● Microsoft Power BI
● BYO Tool via ODBC / JDBC
● Available on: AWS, Azure
● Not supported yet
○ Sort
○ Window
SQL Data Engineering / ELT / ETL
Interactive SQL Analytics
J
u
n
e
Photon: Key Use Cases for Preview
J
u
n
e
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
SELECT
vendor_id,
SUM(trip_distance) as SumTripDistance,
AVG(trip_distance) as AvgTripDistance
FROM abehm.nyc_yellow
WHERE passenger_count IN (1, 2, 4)
GROUP BY vendor_id
ORDER BY vendor_id
Sort
+- Exchange rangepartitioning
+- HashAggregate
+- Exchange hashpartitioning
+- HashAggregate
+- Project
+- Filter
+- ColumnarToRow
+- FileScan
Sort
+- Exchange
+- ColumnarToRow
+- PhotonResultStage
+- PhotonGroupingAgg
+- PhotonShuffleExchangeSource
+- PhotonShuffleMapStage
+- PhotonShuffleExchangeSink
+- PhotonGroupingAgg
+- PhotonProject
+- PhotonFilter
+- PhotonAdapter
+- FileScan
Spark UI
● Yellow → Photon Nodes
● Blue → Spark Nodes
Metrics
● Photon nodes have rich metrics to help
understand behavior and performance
● Easier than Spark where several nodes
are squashed together
1
2
3
4
Performance observations
Customer Feedback
Test Date
Average Query
Response time
(seconds)
Reduction
from
previous
June '20
DBR v6.6
7.8
December
'20
Photon
6.2 21%
May '21
Photon
4.4 29%
44% reduction
2.5x
3.7x
Avg query speedup
Power Test speedup
DEMO
"Demo" - just a walkthrough showing where users
can turn on Photon in Databricks?
Note: From getting started to executing existing
code/queries and monitoring Photon (Spark UI +
Query execution on SQLA)
Logo slide with generalized perf observations
brought down merge latency by 2-3x
Summary
Related Talks
WEDNESDAY
• 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
• 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm,
Databricks
• 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume
THURSDAY
• 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
• 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks
FRIDAY
• 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks
How to get started
In June
databricks.com/try
SQL> SELECT questions FROM audience;
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Data Types Operators
✅ Byte/Short/Int/Long
✅ Boolean
✅ String/Binary
✅ Decimal
✅ Float/Double
✅ Date/Timestamp
✅ Struct
Coming soon: Array, Map
✅ Scan, Filter, Project
✅ Hash Aggregate/Join/Shuffle
✅ Nested-Loop Join
✅ Null-Aware Anti Join
✅ Union, Expand, ScalarSubquery
Coming soon: Sort, Window
Expressions
✅ Comparison / Logic
✅ Arithmetic / Math (most)
✅ Conditional (IF, CASE, etc.)
✅ String (common ones)
✅ Casts
✅ Aggregates (most common
ones)
✅ Date/Timestamp (in progress)
Coming soon: UDFs, long tail
● Parsing
● Catalyst: Analysis/Planning/Optimization
● Scheduling
Execute Task
Client: Submit SQL Query
Execute Task Execute Task Execute Task Spark Executors
Mixed
JVM/Native
Spark Driver
JVM
Delta Lake
1
0
1
0
1
0
1
0
1
0
1
0
Ad

More Related Content

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 

Similar to Radical Speed for SQL Queries on Databricks: Photon Under the Hood (20)

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
hyeongchae lee
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Tim Callaghan
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
Senturus
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
HostedbyConfluent
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
Databricks
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
Andreas Grabner
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
hyeongchae lee
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Tim Callaghan
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
Senturus
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
HostedbyConfluent
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
Databricks
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
Andreas Grabner
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

  • 1. Radical Speed for SQL Queries on Databricks: Photon Under the Hood Alex Behm Tech Lead, Databricks Greg Rahn Staff Product Manager, Databricks
  • 2. Agenda ▪ Intro to Photon ▪ Recent Developments ▪ Up Next ▪ Summary
  • 4. Observed Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance. ▪ Most columns don’t have "NOT NULL" constraints defined ▪ Strings are convenient but slower than specific types ▪ Data lifecycle: Raw → Bronze → Silver → Gold Can we get both agility and performance?
  • 5. -- Data [Analysts | Engineers | Scientists] everywhere Just one more ask: SQL as a first-class citizen on Databricks
  • 6. What is Photon? Photon is a new 100% Apache Spark compatible query engine designed for speed and flexibility. It’s built from the ground up to deliver the fastest performance on modern cloud hardware for all data use cases across data engineering, data science, machine learning, and data analytics.
  • 7. • Re-architected for the fastest performance on real-world applications • Native C++ engine for faster queries • Custom built memory management to avoid JVM bottlenecks • Vectorized: memory, instruction, and data parallelism (SIMD) • Works with your existing code and avoids vendor lock-in • 100% compatible with open source Spark DataFrame APIs and Spark SQL • Transparent operation to users - no need to invoke something new, it just works • Optimizing for all data use cases and workloads • Today, supporting SQL and DataFrame workloads • Coming soon, Streaming, Data Science, and more Building the next generation query engine
  • 8. Why build a new execution engine?
  • 9. ● Parsing ● Catalyst: Analysis/Planning/Optimization ● Scheduling Execute Task Client: Submit SQL Query Execute Task Execute Task Execute Task Spark Executors Mixed JVM/Native Spark Driver JVM Photon in the Databricks Lakehouse Platform Delta Lake 1 0 1 0 1 0 1 0 1 0 1 0
  • 10. • Hybrid Photon/Spark Plans • Use Photon when possible, fall back to Spark for unsupported operations • Completely transparent to users • Native code using off-heap memory • Natural access to memory and intrinsics (no fiddling with Java Unsafe) • No JVM GC, large heaps ok • No JVM JIT performance cliffs / limitations • Fully integrated with Spark’s memory manager • Prefers hash join over sort-merge join • Rich per-operator performance metrics Key Photon Characteristics
  • 12. Development Focus Areas 1. Production Readiness a. Goal: Resilience comparable to DBR → spilling support b. Testing and hardening, real customer workloads 2. Query Coverage a. Today: Basics like joins/aggregations/shuffle, common types and functions b. In development: Nested types, built-in functions c. Coming soon: Sort/Window 3. Performance a. Analyze and optimize common usage patterns
  • 13. Disclaimer: Microbenchmarks Microbenchmarks do not necessarily reflect real-world end-to-end performance During Photon development we analyze and optimize performance with extensive microbenchmarks In the following slides, we share benchmark results that were run in controlled and narrowly scoped scenarios
  • 14. Resilience with Very Large Inputs • Spilling for very large inputs • Write intermediate state to external storage to process inputs exceeding available memory ✅ Hash Shuffle ✅ Hash Aggregation ✅ Hash Join 2-5x Speedup
  • 15. Example: Spilling Hash Join [1 of 4] Partitioned Hash Table • Hash join has two phases • build and probe • Build phase: insert records from one join input into the hash table • Hash table has a fixed number of partitions
  • 16. Example: Spilling Hash Join [2 of 4] • When memory runs out spill one partition to disk • New records go to in-memory partitions or straight to disk • Repeat until build is done Partitioned Hash Table
  • 17. Example: Spilling Hash Join [3 of 4] • Probe phase: process rows from other join input • Emit results for probe rows matching in-memory build partitions • Spill probe rows matching a spilled build partition Partitioned Hash Table Build Probe
  • 18. Example: Spilling Hash Join [4 of 4] • For each spilled partition, repeat the same build/probe process • Might spill again! Apply same algorithm recursively Build Probe ⨝
  • 19. Spilling Hash Join vs. Spilling Sort-Merge Join • Photon converts Sort-Merge Joins to Hash Joins • Sort Merge Join • Buffer + sort both join inputs, increasing memory pressure • Spilling sort → write entire input to sorted runs • Hash Join • Only buffer build input (typically the smaller input) in a hash table • Graceful degradation: Spill both inputs at the build-partition granularity • Role reversal: Swap build/probe when processing spilled partitions Up to 5x Speedup
  • 20. Hardening: How we test Photon • Random queries and data • Using new open-source Spark random query generator • Failure injection • Randomly trip error paths to ensure graceful query failure • Spill injection • Randomly trigger spill events to simulate memory pressure • Clang/LLVM C++ tools • Address Sanitizer • Undefined Behavior Sanitizer • Combinations of the above 🐛 🔨
  • 22. Overview of Query Coverage Data Types Operators ✅ Byte/Short/Int/Long ✅ Boolean ✅ String/Binary ✅ Decimal ✅ Float/Double ✅ Date/Timestamp ✅ Struct Coming soon: Array, Map ✅ Scan, Filter, Project ✅ Hash Aggregate/Join/Shuffle ✅ Nested-Loop Join ✅ Null-Aware Anti Join ✅ Union, Expand, ScalarSubquery Coming soon: Sort, Window Expressions ✅ Comparison / Logic ✅ Arithmetic / Math (most) ✅ Conditional (IF, CASE, etc.) ✅ String (common ones) ✅ Casts ✅ Aggregates (most common ones) ✅ Date/Timestamp (in progress) Coming soon: UDFs, long tail
  • 23. Expression Coverage for DATE/TIMESTAMP • Many queries contain date/timestamp logic • As of today: 95% coverage (100% very soon) • Fast path for UTC timezone (default) • Some expressions are very complicated to implement • Individual functions run in Spark, but still run the operator/plan in Photon
  • 24. Microbenchmarks do not necessarily reflect speedups on end-to-end queries, functions optimized for UTC timezone, your mileage may vary
  • 25. Nested/Complex Type Support • ✅ Struct • Array / Map, in active development • Reading data and basic usage/functions work • In progress: collect_list() / collect_set() • Long tail of array expressions
  • 26. Microbenchmarks do not necessarily reflect speedups on end-to-end queries, your mileage may vary
  • 27. • Currently supports all scalar types and Struct • Array/Map in active development • Can be turned on/off independently of Photon • spark.databricks.photon.parquetWriter.enabled = true • Typical speedups: 2-4x • Wider (>100 columns) tables can see even more gains Writing Delta/Parquet Data
  • 28. DML Support [DELETE / UPDATE / MERGE] • Bulk of work like joins/aggregations run in Photon • Benefits from Photon Delta/Parquet writing capability • Typical speedups: 2-3x ANSI SQL Support • Development in tandem with open-source Spark • Fail queries on overflow or similar errors
  • 30. Current/Up Next Efforts in Photon • Finishing nested type support, including writes • Outstanding ANSI SQL behaviors • Sort and Window operators • Support for bucketed tables
  • 31. How to use Photon today
  • 32. ● Enable Photon via Workspace cluster ● Notebook or JAR ● Available on: AWS ● Not supported yet ○ UDFs ○ Streaming ● Photon via Databricks SQL ● Redash ● Tableau ● Microsoft Power BI ● BYO Tool via ODBC / JDBC ● Available on: AWS, Azure ● Not supported yet ○ Sort ○ Window SQL Data Engineering / ELT / ETL Interactive SQL Analytics J u n e Photon: Key Use Cases for Preview J u n e
  • 34. SELECT vendor_id, SUM(trip_distance) as SumTripDistance, AVG(trip_distance) as AvgTripDistance FROM abehm.nyc_yellow WHERE passenger_count IN (1, 2, 4) GROUP BY vendor_id ORDER BY vendor_id Sort +- Exchange rangepartitioning +- HashAggregate +- Exchange hashpartitioning +- HashAggregate +- Project +- Filter +- ColumnarToRow +- FileScan Sort +- Exchange +- ColumnarToRow +- PhotonResultStage +- PhotonGroupingAgg +- PhotonShuffleExchangeSource +- PhotonShuffleMapStage +- PhotonShuffleExchangeSink +- PhotonGroupingAgg +- PhotonProject +- PhotonFilter +- PhotonAdapter +- FileScan
  • 35. Spark UI ● Yellow → Photon Nodes ● Blue → Spark Nodes Metrics ● Photon nodes have rich metrics to help understand behavior and performance ● Easier than Spark where several nodes are squashed together
  • 38. Customer Feedback Test Date Average Query Response time (seconds) Reduction from previous June '20 DBR v6.6 7.8 December '20 Photon 6.2 21% May '21 Photon 4.4 29% 44% reduction
  • 40. DEMO "Demo" - just a walkthrough showing where users can turn on Photon in Databricks? Note: From getting started to executing existing code/queries and monitoring Photon (Spark UI + Query execution on SQLA)
  • 41. Logo slide with generalized perf observations brought down merge latency by 2-3x
  • 43. Related Talks WEDNESDAY • 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks • 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm, Databricks • 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume THURSDAY • 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics • 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks FRIDAY • 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast & Molly Nagamuthu, Databricks
  • 44. How to get started In June databricks.com/try
  • 45. SQL> SELECT questions FROM audience;
  • 46. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 47. Data Types Operators ✅ Byte/Short/Int/Long ✅ Boolean ✅ String/Binary ✅ Decimal ✅ Float/Double ✅ Date/Timestamp ✅ Struct Coming soon: Array, Map ✅ Scan, Filter, Project ✅ Hash Aggregate/Join/Shuffle ✅ Nested-Loop Join ✅ Null-Aware Anti Join ✅ Union, Expand, ScalarSubquery Coming soon: Sort, Window Expressions ✅ Comparison / Logic ✅ Arithmetic / Math (most) ✅ Conditional (IF, CASE, etc.) ✅ String (common ones) ✅ Casts ✅ Aggregates (most common ones) ✅ Date/Timestamp (in progress) Coming soon: UDFs, long tail
  • 48. ● Parsing ● Catalyst: Analysis/Planning/Optimization ● Scheduling Execute Task Client: Submit SQL Query Execute Task Execute Task Execute Task Spark Executors Mixed JVM/Native Spark Driver JVM Delta Lake 1 0 1 0 1 0 1 0 1 0 1 0
  翻译: