SlideShare a Scribd company logo
Handling	data	skew
adaptively	in Spark using
Dynamic	Repartitioning
Zoltán	Zvara
zoltan.zvara@sztaki.hu
Introduction
• Hungarian	Academy	of	Sciences,	Institute	for	Computer	Science	and	
Control (MTA	SZTAKI)
• Research	institute	with	strong	industry	ties
• Big	Data	projects	using	Spark,	Flink,	Cassandra,	Hadoop	etc.
• Multiple	telco	use	cases	lately,	with	challenging	data	volume	and	
distribution
Agenda
• Our	data-skew	story
• Problem	definitions	&	aims
• Dynamic	Repartitioning
• Architecture
• Component	breakdown
• Repartitioning	 mechanism
• Benchmark	results
• Visualization
• Conclusion
Motivation
• We	have	developed	an	application	aggregating	telco	data	that	tested	
well	on	toy	data
• When	deploying	it	against	the	real	dataset	the	application	seemed	
healthy
• However	it	could	become	surprisingly	slow or	even	crash
• What	did	go	wrong?
Our	data-skew	story
• We	have	use-cases	when	map-side	combine	is	not	an	option:
groupBy, join
• 80%	of	the	traffic	generated by 20%	of	the	communication towers
• Most	of	our	data	is	80–20	rule
The	problem
• Using	default	hashing	is	not	going	to	distribute	the	data	uniformly
• Some	unknown	partition(s)	to	contain	a	lot	of	records	on	the	reducer	
side
• Slow	tasks	will	appear
• Data	distribution	is	not	known
in	advance
• „Concept	drifts”	are	common
stage
boundary	
&	shuffle
even	partitioning skewed	data
slow	task
slow	task
Aim
Generally,	make	Spark	a	data-aware	distributed	data-processing	framework
• Collect	information	about	the	data-characteristics	on-the-fly
• Partition	the	data	as	uniformly	as	possible	on-the-fly
• Handle	arbitrary	data	distributions
• Mitigate	the	problem	of	slow	tasks
• Be	lightweight	and	efficient
• Should	not	require	user-guidance
spark.repartitioning = true
Architecture
Slave Slave Slave
Master
approximate
local	key
distribution &	
statistics
1
collect to master 2
global key
distribution &	
statistics
3
4
redistribute new
hash
Driver	perspective
• RepartitioningTrackerMaster part	of	SparkEnv
• Listens	to	job	&	stage	submissions
• Holds	a	variety	of	repartitioning	strategies	for	each	job	&	stage
• Decides when &	how to (re)partition
Executor	perspective
• RepartitioningTrackerWorker part	of	SparkEnv
• Duties:
• Stores	ScannerStrategies (Scanner included)	received	from	the	RTM
• Instantiates	 and	binds	Scanners	 to	TaskMetrics (where	data-
characteristics	 is	collected)
• Defines	an	interface	for	Scanners	 to	send	DataCharacteristics back	to	
the	RTM
Scalable sampling
• Key-distributions are approximated with a	strategy,	that is
• not sensitive to early or late concept drifts,
• lightweight and	efficient,
• scalable by using a	backoff strategy
keys
frequency
counters
keys
frequency
sampling	 rate of	𝑠" sampling	 rate of	𝑠" /𝑏
keys
frequency
sampling	 rate of	𝑠"%&/𝑏
truncateincrease by 𝑠"
𝑇𝐵
Complexity-sensitive sampling
• Reducer	run-time	can	highly	correlate	with	the	computational	
complexity of	the	values	for	a	given	key
• Calculating	object	size	is	costly	in	JVM	and	in	some	cases	shows	little	
correlation	to	computational	complexity	(function	on	the	next	stage)
• Solution:
• If	the	user	object	is	Weightable,	 use	complexity-sensitive	 sampling
• When	increasing	the	counter	for	a	specific	value,	consider	its	complexity
Scalable sampling in numbers
• In	Spark,	it	has	been	implemented	with	Accumulators
• Not	so	efficient,	but	we	wanted	to	implement	it	with	minimal	impact	on	the	existing	
code
• Optimized	with	micro-benchmarking
• Main	factors:
• Sampling	strategy	- aggressiveness (initial	sampling	ratio,	back-off	factor,	etc…)
• Complexity	of	the	current	stage	(mapper)
• Current	stage’s	runtime:
• When	used	throughout	the	execution	of	the	whole	stage	it	adds	5-15%	to	runtime
• After	repartitioning,	we	cut	out	the	sampler’s	code-path; in	practice,
it	adds	0.2-1.5% to	runtime
Main	new task-level metrics
• TaskMetrics
• RepartitioningInfo – new	hash	function,	repartition	strategy
• ShuffleReadMetrics
• (Weightable)DataCharacteristics – used	for	testing	the	correctness	of	the	
repartitioning	&	for	execution	visualization	tools
• ShuffleWriteMetrics
• (Weightable)DataCharacteristics – scanned	periodically	by	Scanners	based	on	
ScannerStrategy
• insertionTime
• repartitioningTime
• inMemory
• ofSpills
Scanner
• Instantiated	for	each	task	before	the	executor	starts	them
• Different	implementations:	Throughput, Timed, Histogram
• ScannerStrategy defines:
• when	to	send	to	the	RTM,
• histogram-compaction	 level.
Decision	to	repartition
• RepartitioningTrackerMaster can	use	different	decision	
strategies:
• number	of	local	histograms	needed,
• global	histogram’s	distance	from	uniform	distribution,
• preferences	in	the	construction	of	the	new	hash	function.
Construction of	the new hash function
𝑐𝑢𝑡, - single-key cut
hash a	key here	with the probability
proportional to the remaining space to reach 𝑙
𝑐𝑢𝑡. - uniform	 distribution
top	prominent keys hashed
𝑙
𝑐𝑢𝑡/ - probabilistic cut
New	hash function in numbers
• More	complex	than	a	hashCode
• We	need	to	evaluate	it	for	every	record
• Micro-benchmark	(for	example	String):
• Number	of	partitions:	 512
• HashPartitioner:	 AVG	time to hash a	record is	90.033 ns
• KeyIsolatorPartitioner:	 AVG	time to hash a	record is	121.933	ns
• In	practice	it	adds	negligible	overhead,	under	1%
Repartitioning
• Reorganize	previous	naive	hashing
• Usually	happens	in-memory
• In	practice,	adds	additional	1-8%	overhead	(usually	the	lower	end),	
based	on:
• complexity	of	the	mapper,
• length	of	the	scanning	process.
More	numbers (groupBy)
MusicTimeseries – groupBy on tags from a	listenings-stream
1710 200 400
83M
134M
92M
sizeof	thebiggestpartition
number of	partitions
over-partitioning overhead
&	blind scheduling
64%	reduction
observed 3%	map-side overhead
naive
Dynamic Repartitioning
38%	reduction
More	numbers (join)
• Joining	tracks	with	tags
• Tracks	dataset	is	skewed
• Number	of	partitions	is	set	to	33
• Naive:
• size	of	the	biggest	partition	=	4.14M
• reducer’s	stage	runtime	=	251	seconds
• Dynamic	Repartitioning
• size	of	the	biggest	partition	=	2.01M
• reducer’s	stage	runtime	=	124 seconds
• heavy	map,	only	0.9%	overhead
Spark REST	API
• New	metrics	are	available	through	the	REST	API
• Added	new	queries	to	the	REST	API,	for	example:
„what	happened	in	the	last	3	second?”
• BlockFetches are	collected	to	ShuffleReadMetrics
Execution visualization of	Spark jobs
Repartitioning in the visualization
heavy keys create
heavy partitions (slow tasks)
heavy is	alone,
size of	the biggest partition
is	minimized
Conclusion
• Our	Dynamic	Repartitioning can	handle	data	skew	dynamically,	on-
the-fly	on	any	workload	and	arbitrary	key-distributions
• With	very	little	overhead,	data	skew	can	be	handled	in	a	natural &	
general way
• Visualizations	can	aid	developers	to	better	understand	issues	and	
bottlenecks	of	certain	workloads
• Making	Spark	data-aware pays	off
Thank	you	for	your	attention
Zoltán	Zvara
zoltan.zvara@sztaki.hu
Ad

More Related Content

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

Viewers also liked (20)

Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Spark Summit
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Spark Summit
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Building Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William BentonBuilding Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William Benton
Spark Summit
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Spark Summit
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Spark Summit
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Building Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William BentonBuilding Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William Benton
Spark Summit
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
Ad

Similar to Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning (20)

Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
MLconf
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
Petr Novotný
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
Vinayak Hegde
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
Keshav Tripathy
 
20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Robin Vermeirsch
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
maria.grineva
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
JPINFOTECH JAYAPRAKASH
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Debajani Mohanty
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
MLconf
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
Petr Novotný
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
Vinayak Hegde
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Robin Vermeirsch
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
maria.grineva
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
JPINFOTECH JAYAPRAKASH
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Debajani Mohanty
 
Ad

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

  翻译: