SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Rui Jian, Hao Lin, Facebook Inc.
rjian@fb.com, hlin@fb.com
Tangram: Distributed Scheduling
Framework for Apache Spark at
Facebook
#UnifiedAnalytics #SparkAISummit
About Us
• Rui Jian
– Software Engineer at Facebook (Data Warehouse & Graph Indexing)
– Master of Computer Science (Shanghai Jiao Tong university)
• Hao Lin
– Research scientist at Facebook (Data Warehouse Batch Scheduling)
– PhD in Parallel Computing (Purdue ECE)
3#UnifiedAnalytics #SparkAISummit
Agenda
• Overview
• Tangram Architecture
• Scheduling Policies & Resource Allocation
• Future work
4#UnifiedAnalytics #SparkAISummit
What is Tangram?
The scheduling platform for
• reliably running various batch workloads
• with efficient heterogenous resource management
• at scale
5#UnifiedAnalytics #SparkAISummit
Tangram Scheduling Targets
• Single jobs: adhoc/periodic
• Batch jobs: adhoc/periodic, malleable
• Gang jobs: adhoc/periodic, rigid
• Long-running jobs: steady and regular; e.g. online training
6#UnifiedAnalytics #SparkAISummit
Why Tangram?
• Various workload characteristics
– ML
– Apache Spark
– Apache Giraph
– Single jobs
• Customized scheduling policies
• Scalability
– Fleet size: hundreds of thousands worker nodes
– Job scheduling throughput: hundreds of millions jobs per day
7#UnifiedAnalytics #SparkAISummit
Overview
• What is Tangram?
8#UnifiedAnalytics #SparkAISummit
Admin
Job Manager
DB
ML
Resource
Manager
Master
Agent AgentAgent
Single
Job
Gang Job
ML Elastic
Scheduler
1
2
3
4
5
6
SQL query
Giraph
Spark
Client Library
9#UnifiedAnalytics #SparkAISummit
• Job management
• Request/Release resources
• Resource grant
• Preemption notification
• Launch containers
• Container status change event
Tangram
client
Resource
Manager
Agent
Application
1
2
3
4
5
6
Agent
• Report schedulable resources and runtime usage
• Health check reports
• Detect labels
• Launch/Kill Containers
• Container recovery
• Resource isolation with cgroup v2
10#UnifiedAnalytics #SparkAISummit
Failure Recovery
• Agent failure
– Scan the recovery directory and recover the running containers
• RM failure
– Both agent and client hold off communication to the RM until the new
master shows up
– Client sync session info to the new master to help it build the states
– Agents add them to the new master
11#UnifiedAnalytics #SparkAISummit
Scheduling Policies
• Hierarchical queue structure
• Jobs to be queued on leaves
• Queue configs:
– min/max resources
– Policy:
• FIFO
• Dominant Resource Fairness (DRF)
• User fairness
• Global
• …
12#UnifiedAnalytics #SparkAISummit
/
ads feed
pipelines interactive
Job
DRF
DRF DRF
User FairnessFIFO
20%80%
50% 50%
user1 user2
50% 50%
FIFO FIFO
Job
Job Job
Scheduling Policies
• Jobs ordered by priority, submission time within queue
• Gang job as first class in scheduling and resource allocation
• Lookahead scheduling for better throughput and utilization
• Job starvation prevention
13#UnifiedAnalytics #SparkAISummit
Gang 200 Gang 20 Single Gang 4 Single
Resource Allocation
• Fine-grained resource specification:
– {cpuMilliCores: 3000, memoryBytes: 200GB}
• Constraints:
– “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10”
• Job Affinity:
– inSameDatacenter
14#UnifiedAnalytics #SparkAISummit
Resource Allocation
15#UnifiedAnalytics #SparkAISummit
Prefetched
Host Cache
• Bypass the
steps of
host
filtering
and
scoring
• Speedup
allocation
process
Host Filtering
• Hard &
Soft
constraints
• Resource
constraint
• Label
constraint
• Job affinity
Host Scoring
and Ordering
• Packing
efficiency
• Host
healthiness
• Data
locality
Commit
Allocation
• Book
keeping
resources
• Update
cluster &
queue
parameters
Constraint-based Scheduling
• Machine type
• Datacenter
• Region
• CPU architecture
• Host prefix
• …
16#UnifiedAnalytics #SparkAISummit
Merged host pool - type 1 & 2
Job
Job
Job
Host 1
Host 2
Host 3
Host 4
Host 5
Labeled with
{”type”:”2”}
Labeled with
{”type”:”1”}
Job Job
Job constraint:
type=2
Job constraint:
type=1
Queue
Preemption
• Guarantee resource availability SLO within and across queues
• Identify the starving jobs and overallocated jobs
• Minimize preemption cost: two-phase protocol
– Only candidates appearing in both phases will be preempted
– Resource Manager notifies client with preemption intent s.t. necessary action can
be taken, e.g. checkpointing
17#UnifiedAnalytics #SparkAISummit
Cross Datacenter Scheduling
• The growing demand of computation and storage for Hive tables
spans across data centers
• Stranded capacity with imbalanced load
• Poor data locality and waste of network bandwidth
• Slow reaction to recover from crisis and disaster
18#UnifiedAnalytics #SparkAISummit
Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
19#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
20#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1
Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
21#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1
Table DataTable Data
Future Work
• Mix workloads managed by one resource manager
• Run batch workloads with off-peak resources from online services
• Automatic resource tuning for high utilization
• We’re hiring! Contact: rjian@fb.com
22#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Ad

More Related Content

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Redis at LINE
Redis at LINERedis at LINE
Redis at LINE
LINE Corporation
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Redis
RedisRedis
Redis
imalik8088
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 

Similar to Tangram: Distributed Scheduling Framework for Apache Spark at Facebook (20)

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache Spark
Rui Liu
 
ngs07.data-center.ssadasdasdasdlides.pptx
ngs07.data-center.ssadasdasdasdlides.pptxngs07.data-center.ssadasdasdasdlides.pptx
ngs07.data-center.ssadasdasdasdlides.pptx
dusanheliant
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
Ilkay Altintas, Ph.D.
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
MohammedShahid562503
 
Webinar: Capacity Planning
Webinar: Capacity PlanningWebinar: Capacity Planning
Webinar: Capacity Planning
MongoDB
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
Norberto Leite
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Unit ii sem-v-hadoop
Unit ii  sem-v-hadoopUnit ii  sem-v-hadoop
Unit ii sem-v-hadoop
DrChitraDhawale
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache Spark
Rui Liu
 
ngs07.data-center.ssadasdasdasdlides.pptx
ngs07.data-center.ssadasdasdasdlides.pptxngs07.data-center.ssadasdasdasdlides.pptx
ngs07.data-center.ssadasdasdasdlides.pptx
dusanheliant
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
Ilkay Altintas, Ph.D.
 
Webinar: Capacity Planning
Webinar: Capacity PlanningWebinar: Capacity Planning
Webinar: Capacity Planning
MongoDB
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
Norberto Leite
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Rui Jian, Hao Lin, Facebook Inc. rjian@fb.com, hlin@fb.com Tangram: Distributed Scheduling Framework for Apache Spark at Facebook #UnifiedAnalytics #SparkAISummit
  • 3. About Us • Rui Jian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) 3#UnifiedAnalytics #SparkAISummit
  • 4. Agenda • Overview • Tangram Architecture • Scheduling Policies & Resource Allocation • Future work 4#UnifiedAnalytics #SparkAISummit
  • 5. What is Tangram? The scheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale 5#UnifiedAnalytics #SparkAISummit
  • 6. Tangram Scheduling Targets • Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training 6#UnifiedAnalytics #SparkAISummit
  • 7. Why Tangram? • Various workload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day 7#UnifiedAnalytics #SparkAISummit
  • 8. Overview • What is Tangram? 8#UnifiedAnalytics #SparkAISummit Admin Job Manager DB ML Resource Manager Master Agent AgentAgent Single Job Gang Job ML Elastic Scheduler 1 2 3 4 5 6 SQL query Giraph Spark
  • 9. Client Library 9#UnifiedAnalytics #SparkAISummit • Job management • Request/Release resources • Resource grant • Preemption notification • Launch containers • Container status change event Tangram client Resource Manager Agent Application 1 2 3 4 5 6
  • 10. Agent • Report schedulable resources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 10#UnifiedAnalytics #SparkAISummit
  • 11. Failure Recovery • Agent failure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master 11#UnifiedAnalytics #SparkAISummit
  • 12. Scheduling Policies • Hierarchical queue structure • Jobs to be queued on leaves • Queue configs: – min/max resources – Policy: • FIFO • Dominant Resource Fairness (DRF) • User fairness • Global • … 12#UnifiedAnalytics #SparkAISummit / ads feed pipelines interactive Job DRF DRF DRF User FairnessFIFO 20%80% 50% 50% user1 user2 50% 50% FIFO FIFO Job Job Job
  • 13. Scheduling Policies • Jobs ordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention 13#UnifiedAnalytics #SparkAISummit Gang 200 Gang 20 Single Gang 4 Single
  • 14. Resource Allocation • Fine-grained resource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter 14#UnifiedAnalytics #SparkAISummit
  • 15. Resource Allocation 15#UnifiedAnalytics #SparkAISummit Prefetched Host Cache • Bypass the steps of host filtering and scoring • Speedup allocation process Host Filtering • Hard & Soft constraints • Resource constraint • Label constraint • Job affinity Host Scoring and Ordering • Packing efficiency • Host healthiness • Data locality Commit Allocation • Book keeping resources • Update cluster & queue parameters
  • 16. Constraint-based Scheduling • Machine type • Datacenter • Region • CPU architecture • Host prefix • … 16#UnifiedAnalytics #SparkAISummit Merged host pool - type 1 & 2 Job Job Job Host 1 Host 2 Host 3 Host 4 Host 5 Labeled with {”type”:”2”} Labeled with {”type”:”1”} Job Job Job constraint: type=2 Job constraint: type=1 Queue
  • 17. Preemption • Guarantee resource availability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing 17#UnifiedAnalytics #SparkAISummit
  • 18. Cross Datacenter Scheduling • The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster 18#UnifiedAnalytics #SparkAISummit
  • 19. Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 19#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job
  • 20. Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 20#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1
  • 21. Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 21#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1 Table DataTable Data
  • 22. Future Work • Mix workloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com 22#UnifiedAnalytics #SparkAISummit
  • 23. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  翻译: