SlideShare a Scribd company logo
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Dancing Elephants:
Working with Object Storage in
Apache Spark and Hive
Sanjay Radia
Steve Loughran
March 2017
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
About Speakers
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Why Cloud?
• No upfront hardware costs – Pay as you use
• Elasticity
• Often lower TCO
• Natural for Data ingress for IoT, mobile apps, ..
• Business agility
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Shared Data and Cloud Storage
⬢ Cloud Storage is the Shared Data Lake
–For both Hadoop and Cloud-native (non-Hadoop) Apps
–Lower Cost
•HDFS via EBS can get expensive
•HDFS’s role changes
–Built-in geo-distribution and DR
⬢ Challenges
–Cloud storage designed for low cost and geo-distribution
–Performance is slower – was not designed for Data intensive apps
–Cloud storage segregated from compute
–API and semantics not like a FS – especially wrt. consistency
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Making Object Stores work for Big Data Apps
⬢ Focus Areas
–Address cloud storage consistency
–Performance (changes in connectors and frameworks)
–Caching in memory and local storage
⬢ Other issues not covered in this talk
–Shared Metastore, Common Governance, Security
across clusters
Columnar access control to Tabular data
See Hortonworks cloud offering
Shared Data
& Storage
10101
10101010101
01010101010101
0101010101010101010
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Cloud Storage Integration: Evolution for Agility
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the persistent Data Lake
Input Output
Backup Restore
Input
Output
Upload
HDFS
Application
Input
Output
tmp
AzureAWS –today ->>>>>
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Some Use Cases
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
ORC
datasets
inbound
Elastic ETL
external
HDFS (temp)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
datasets
external
Notebooks
library HDFS (temp)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Streaming
Sam, Storm,
Flink, Spark,
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Danger:
Object stores are not filesystems
Cost & Geo-distribution over
Consistency and Performance
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
A Filesystem: Directories, Files  Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Object Store: hash(name)⇒data
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
The dangers of Eventual Consistency and Atomicity
⬢ Temp Data leftovers
–Annoying Garbage or Worse if direct output committer is used
⬢ List inconsistency means new data may not be visible
⬢ Lack of atomic rename() can leave output directories inconsistent
You can get bad or missing data and not even notice
Especially if only a portion of your large data is missing
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
s3:// —“inode on S3”
s3n://
“Native” S3
s3a:// Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
Phase I: Stabilize S3A
oss://
Aliyun
gs://
Google Cloud
Phase II: speed & scale
adl://
Azure Data Lake
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
s3://
Amazon EMR S3
History of Object Storage Support
Phase III: scale & consistency
(proprietary)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Cloud Storage Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
progress by Hortonworks
● Performance improvements recently and
in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Could improve test coverage
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Make Apache Hadoop at home
in the cloud
Step 1: Hadoop runs great on Azure
Step 2: Beat EMR on EC2
✔ ✔
✔
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem: S3 Analytics
is too slow/broken
1. Analyze benchmarks and bug-reports
2. Optimize the non-io metadataops (very cheap on HDFS)
3. Fix Read path for Columnar Data
4. Fix Write path
5. Improve query partitioning
6. The Commitment Problem
getFileStatus()
read()
LLAP (single node) on AWS
TPC-DS queries at 200 GB scale
readFully(pos)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Hadoop 2.8/HDP 2.6 transforms I/O performance!
// forward seek by skipping stream
fs.s3a.readahead.range=256K
// faster backward seek for Columnar Storage
fs.s3a.experimental.input.fadvise=random
// Write-IO - enhanced data upload (parallel background uploads)
// Additional flags for mem vs disk
fs.s3a.fast.output.enabled=true
fs.s3a.multipart.size=32M
fs.s3a.fast.upload.active.blocks=8
// Additional per-bucket flags
—see HADOOP-11694 for lots more!
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Every HTTP request is precious
⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs()
⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories()
⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs()
⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata
⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects
see HADOOP-11694
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Caching in Memory or Local Disk (ssd)
even more relevant for slow cloud storage
benchmarks != your queries, your data, your VMs, …
…but we think we've made a good start
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3 Data Source 1TB TPCDS LLAP- vs Hive 1.x:
0
500
1,000
1,500
2,000
2,500
LLAP-1TB-TPCDS
Hive-1-1TB-TPCDS
1 TB TPC-DS ORC DataSet
3 x i2x4x Large (16 CPU x 122 GB RAM x 4 SSD)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Rename Problem and
Direct Output
Committer
"
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
The S3 Commitment Problem
rename() used for atomic commitment transaction
⬢ Additional Time: copy() + delete() proportional to data * files
– Server side copy is used to make this faster, but still a copy
– Non-Atomic!!
⬢ Alternate: Direct output committer can solve the performance problem
⬢ BOTH can give wrong results
–Intermediate data may be visible
–Failures (task or job) leave storage in unknown state
–Speculative execution makes it worse
⬢ BTW Compared to Azure Storage, S3 is slow (6-10+ MB/s)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Spark's Direct Output Committer? Risk of Corruption of data
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3guard
Fast, consistent S3 metadata
HADOOP-13445
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
DynamoDB as fast, consistent metadata store
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Netflix Staging Committer
1. Saves output to file://
2. Task commit: upload to S3A as multipart PUT —but does not commit the PUT, just saves the
information about it to hdfs://
3. Normal commit protocol manages task and job data promotion in HDFS
4. Final Job committer reads pending information and generates final PUT
—possibly from a different host
Outcome:
⬢ No visible overwrite until final job commit: resilience and speculation
⬢ Task commit time = data/bandwidth
⬢ Job commit time = POST * #files
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Availability
 Read + Write in HDP 2.6 and Apache Hadoop 2.8
 S3Guard: preview of DDB caching soon
 Zero-rename commit: work in progress
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Summary
⬢ Cloud Storage is the Data Lake on the Cloud
– HDFS plays a different role
⬢ Challenges: Performance, Consistency, Correctness
– Output committer – non-atomicity should not be ignored
⬢ We have made significant improvements
– Object store connectors
– Upper layers, such as Hive and ORC
– Up-coming S3Guard
⬢ LLAP as the cache for tabular data
⬢ Other considerations
– Shared Metadata, Security and Governance (See HDP Cloud offerings)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Big thanks to:
Rajesh Balamohan
Mingliang Liu
Chris Nauroth
Dominik Bialek
Ram Venkatesh
Everyone in QE, RE
+ everyone who reviewed/tested, patches and
added their own, filed bug reports and measured
performance
© Hortonworks Inc. 2011 – 2017 All Rights Reserved3
5
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
stevel@hortonworks.com @steveloughran
sanjay@hortonworks.com @srr
Ad

More Related Content

What's hot (20)

Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
DataWorks Summit/Hadoop Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
DataWorks Summit/Hadoop Summit
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 

Similar to Dancing elephants - efficiently working with object stores from Apache Spark and Apache Hive (20)

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Michael Young
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 

Dancing elephants - efficiently working with object stores from Apache Spark and Apache Hive

  • 1. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Dancing Elephants: Working with Object Storage in Apache Spark and Hive Sanjay Radia Steve Loughran March 2017
  • 2. © Hortonworks Inc. 2011 – 2017 All Rights Reserved About Speakers Sanjay Radia Chief Architect, Founder, Hortonworks Part of the original Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! –Apache Hadoop PMC and Committer Prior Data center automation, virtualization, Java, HA, OSs, File Systems Startup, Sun Microsystems, Inria … Ph.D., University of Waterloo Page 2
  • 3. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Why Cloud? • No upfront hardware costs – Pay as you use • Elasticity • Often lower TCO • Natural for Data ingress for IoT, mobile apps, .. • Business agility
  • 4. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Shared Data and Cloud Storage ⬢ Cloud Storage is the Shared Data Lake –For both Hadoop and Cloud-native (non-Hadoop) Apps –Lower Cost •HDFS via EBS can get expensive •HDFS’s role changes –Built-in geo-distribution and DR ⬢ Challenges –Cloud storage designed for low cost and geo-distribution –Performance is slower – was not designed for Data intensive apps –Cloud storage segregated from compute –API and semantics not like a FS – especially wrt. consistency
  • 5. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Making Object Stores work for Big Data Apps ⬢ Focus Areas –Address cloud storage consistency –Performance (changes in connectors and frameworks) –Caching in memory and local storage ⬢ Other issues not covered in this talk –Shared Metastore, Common Governance, Security across clusters Columnar access control to Tabular data See Hortonworks cloud offering Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010
  • 6. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cloud Storage Integration: Evolution for Agility HDFS Application HDFS Application GoalEvolution towards cloud storage as the persistent Data Lake Input Output Backup Restore Input Output Upload HDFS Application Input Output tmp AzureAWS –today ->>>>>
  • 7. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Some Use Cases
  • 8. © Hortonworks Inc. 2011 – 2017 All Rights Reserved ORC datasets inbound Elastic ETL external HDFS (temp)
  • 9. © Hortonworks Inc. 2011 – 2017 All Rights Reserved datasets external Notebooks library HDFS (temp)
  • 10. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming Sam, Storm, Flink, Spark,
  • 11. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Danger: Object stores are not filesystems Cost & Geo-distribution over Consistency and Performance
  • 12. © Hortonworks Inc. 2011 – 2017 All Rights Reserved A Filesystem: Directories, Files  Data / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete")
  • 13. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Object Store: hash(name)⇒data 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"]
  • 14. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Often: Eventually Consistent 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 GET /work/pending/part-00 GET /work/pending/part-00 200 200 200
  • 15. © Hortonworks Inc. 2011 – 2017 All Rights Reserved The dangers of Eventual Consistency and Atomicity ⬢ Temp Data leftovers –Annoying Garbage or Worse if direct output committer is used ⬢ List inconsistency means new data may not be visible ⬢ Lack of atomic rename() can leave output directories inconsistent You can get bad or missing data and not even notice Especially if only a portion of your large data is missing
  • 16. © Hortonworks Inc. 2011 – 2017 All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gs
  • 17. © Hortonworks Inc. 2011 – 2017 All Rights Reserved s3:// —“inode on S3” s3n:// “Native” S3 s3a:// Replaces s3n swift:// OpenStack wasb:// Azure WASB Phase I: Stabilize S3A oss:// Aliyun gs:// Google Cloud Phase II: speed & scale adl:// Azure Data Lake 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 s3:// Amazon EMR S3 History of Object Storage Support Phase III: scale & consistency (proprietary)
  • 18. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cloud Storage Connectors Azure WASB ● Strongly consistent ● Good performance ● Well-tested on applications (incl. HBase) ADL ● Strongly consistent ● Tuned for big data analytics workloads Amazon Web Services S3A ● Eventually consistent - consistency work in progress by Hortonworks ● Performance improvements recently and in progress ● Active development in Apache EMRFS ● Proprietary connector used in EMR ● Optional strong consistency for a cost Google Cloud Platform GCS ● Multiple configurable consistency policies ● Currently Google open source ● Good performance ● Could improve test coverage
  • 19. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Make Apache Hadoop at home in the cloud Step 1: Hadoop runs great on Azure Step 2: Beat EMR on EC2 ✔ ✔ ✔
  • 20. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem: S3 Analytics is too slow/broken 1. Analyze benchmarks and bug-reports 2. Optimize the non-io metadataops (very cheap on HDFS) 3. Fix Read path for Columnar Data 4. Fix Write path 5. Improve query partitioning 6. The Commitment Problem
  • 21. getFileStatus() read() LLAP (single node) on AWS TPC-DS queries at 200 GB scale readFully(pos)
  • 22. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hadoop 2.8/HDP 2.6 transforms I/O performance! // forward seek by skipping stream fs.s3a.readahead.range=256K // faster backward seek for Columnar Storage fs.s3a.experimental.input.fadvise=random // Write-IO - enhanced data upload (parallel background uploads) // Additional flags for mem vs disk fs.s3a.fast.output.enabled=true fs.s3a.multipart.size=32M fs.s3a.fast.upload.active.blocks=8 // Additional per-bucket flags —see HADOOP-11694 for lots more!
  • 23. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Every HTTP request is precious ⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs() ⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories() ⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs() ⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata ⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects see HADOOP-11694
  • 24. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Caching in Memory or Local Disk (ssd) even more relevant for slow cloud storage benchmarks != your queries, your data, your VMs, … …but we think we've made a good start
  • 25. © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3 Data Source 1TB TPCDS LLAP- vs Hive 1.x: 0 500 1,000 1,500 2,000 2,500 LLAP-1TB-TPCDS Hive-1-1TB-TPCDS 1 TB TPC-DS ORC DataSet 3 x i2x4x Large (16 CPU x 122 GB RAM x 4 SSD)
  • 26. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Rename Problem and Direct Output Committer "
  • 27. © Hortonworks Inc. 2011 – 2017 All Rights Reserved The S3 Commitment Problem rename() used for atomic commitment transaction ⬢ Additional Time: copy() + delete() proportional to data * files – Server side copy is used to make this faster, but still a copy – Non-Atomic!! ⬢ Alternate: Direct output committer can solve the performance problem ⬢ BOTH can give wrong results –Intermediate data may be visible –Failures (task or job) leave storage in unknown state –Speculative execution makes it worse ⬢ BTW Compared to Azure Storage, S3 is slow (6-10+ MB/s)
  • 28. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Spark's Direct Output Committer? Risk of Corruption of data
  • 29. © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3guard Fast, consistent S3 metadata HADOOP-13445
  • 30. © Hortonworks Inc. 2011 – 2017 All Rights Reserved DynamoDB as fast, consistent metadata store 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 PUT part-00 200 00
  • 31. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Netflix Staging Committer 1. Saves output to file:// 2. Task commit: upload to S3A as multipart PUT —but does not commit the PUT, just saves the information about it to hdfs:// 3. Normal commit protocol manages task and job data promotion in HDFS 4. Final Job committer reads pending information and generates final PUT —possibly from a different host Outcome: ⬢ No visible overwrite until final job commit: resilience and speculation ⬢ Task commit time = data/bandwidth ⬢ Job commit time = POST * #files
  • 32. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Availability  Read + Write in HDP 2.6 and Apache Hadoop 2.8  S3Guard: preview of DDB caching soon  Zero-rename commit: work in progress
  • 33. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Summary ⬢ Cloud Storage is the Data Lake on the Cloud – HDFS plays a different role ⬢ Challenges: Performance, Consistency, Correctness – Output committer – non-atomicity should not be ignored ⬢ We have made significant improvements – Object store connectors – Upper layers, such as Hive and ORC – Up-coming S3Guard ⬢ LLAP as the cache for tabular data ⬢ Other considerations – Shared Metadata, Security and Governance (See HDP Cloud offerings)
  • 34. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Big thanks to: Rajesh Balamohan Mingliang Liu Chris Nauroth Dominik Bialek Ram Venkatesh Everyone in QE, RE + everyone who reviewed/tested, patches and added their own, filed bug reports and measured performance
  • 35. © Hortonworks Inc. 2011 – 2017 All Rights Reserved3 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions? stevel@hortonworks.com @steveloughran sanjay@hortonworks.com @srr

Editor's Notes

  • #7: on prem = 1 azure @ 3 AWS @ 2
  • #8: In all the examples, object stores take a role which replaces HDFS. But this is dangerous, because...
  • #9: This is one of the simplest deployments in cloud: scheduled/dynamic ETL. Incoming data sources saving to an object store; spark cluster brought up for ETL. Either direct cleanup/filter or multistep operations, but either way: an ETL pipeline. HDFS on the VMs for transient storage, the object store used as the destination for data —now in a more efficient format such as ORC or Parquet
  • #10: Notebooks on demand. ; it talks to spark in cloud which then does the work against external and internal data; Your notebook itself can be saved to the object store, for persistence and sharing.
  • #11: Example: streaming on Azure + on LHS add streaming
  • #12: In all the examples, object stores take a role which replaces HDFS. But this is dangerous, because...
  • #13: This pictureis classic HDFS – 3 replicas and metadata First off all there are no real directories in S3 – the directory names are part of the file name – Cloud object store does not treat “/” as a special separator. Here in the rename all you expect is that the data remains and simply some metadata change – that is what happens in HDFS but no so in S3
  • #14: There is no rename operation – a copy and delete has to be done The copied data is new (Yellow) The deleted data (Orange)
  • #15: You might say – big deal the data in pending is deleted later – but imaging if you did direct output connector and was deleting failed task output Also same problem can occur for new data you added.
  • #16: We will come back to the inconsistency problem and the new S3Guard
  • #17: Everything uses the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  • #18: This is the history
  • #20: Simple goal. Make ASF hadoop at home in cloud infra. It's always been a bit of a mixed bag, and there's a lot with agility we need to address: things fail differently. Step 1: Azure. That's the work with Microsoft on wasb://; you can use Azure as a drop-in replacement for HDFS in Azure Step 2: EMR. More specifically, have the ASF Hadoop codebase get higher numbers than EMR
  • #22: Here's a flamegraph of LLAP (single node) with AWS+HDC for a set of TPC-DS queries at 200 GB scale; we should stick this up online only about 2% of time (optimised code) is doing S3 IO. Something at start partitioning data
  • #23: without going into the details, here are things you will want for Hadoop 2.8. They are in HDP 2.5, possible in the next CDH release. The first two boost input by reducing the cost of seeking, which is expensive as it breaks then re-opens the HTTPS connection. Readahead means that hundreds of KB can be skipped before that connect (yes, it can take that long to reconnect). The experimental fadvise random feature speeds up backward reads at the expense of pure-forward file reads. It is significantly faster for reading in optimized binary formats like ORC and Parquet The last one is a successor to fast upload in Hadoop 2.7. That buffers on heap and needs careful tuning; its memory needs conflict with RDD caching. The new version defaults to buffering as files on local disk, so won't run out of memory. Offers the potential of significantly more effective use of bandwidth; the resulting partitioned files may also offer higher read perf. (No data there, just hearsay).
  • #24: If you look at what we've done, much of it (credit to Rajesh & Chris) is just minimising HTTP requests. Each one can take hundreds of millis, sometimes even seconds due to load balancer issues (tip: reduce DNS TTL on your clients to <30s). A lot of the work internal to S3A was culling those getFileStatus() calls by (a) caching results, (b) not bothering to see if they are needed. Example: cheaper to issue a DELETE listing all parent paths than actually looking to see if they exist, wait for the response, and then delete them. The one at the end, HADOOP-13208, replaces a slow recursive tree walk (many status, many list) with a flat listing of all objects in a tree. This works only for the listStatus(path, true) call —to be benefit you need to use that API call, not do your own treewalk.
  • #25: Don't run off saying "hey, 2x speedup". I'm confident we got HDP faster, EMR is still something we'd need to look at more. Data layout is still a major problem here; I think we are still understanding the implications of sharding and throttling. What we do know is that deep/shallow trees are pathological for recursive treewalks, and they end up storing data in the same s3 nodes, so throttling adjacent requests.
  • #26: And the result. Yes, currently we are faster in these benchmarks. Does that match to the outside world? If you use ORC & HIve, you will gain from the work we've done. There are still things which are pathologically bad, especially deep directory trees with few files 26x faster!!!
  • #29: This invariably ends up reaching us on JIRA, to the extent I've got a document somewhere explaining the problem in detail. It was taken away because it can corrupt your data, without you noticiing. This is generally considered harmful.
  • #30: Recap – S3 is eventually cinsistent
  翻译: