Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

Jun 12, 20182 likes766 views

At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation. In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin. Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.

Zhan Zhang, Jane Wang, Facebook
Migrating Apache Hive
Workload to Apache Spark:
Bridge the Gap

Overview
• Hive to Spark Migration Effort
• Narrowing Down Feature Gaps
– Regex Column Specification Support.
– Local Writes support.
– UDFs
• Performance and Reliability
– Dynamic Join
– Bucket Join
• Advanced Optimization for Extremely Large Jobs
– Secondary Partitioning
– Run-time Optimization.

• Why do we migrate workload
from hive to Spark
– Performance
– Identify and narrow down the
feature gap.
Hive to Spark Migration

Regex Column Specification
Support
• One of the most failures in our syntax analysis.
• Support regex column specification.
– SELECT `(a)?+.+` FROM data table
– SELECT t.`(a)?+.+` FROM data table
• SPARK-12139
4put your #assignedhashtag here by setting the footer in view-header/footer

Local Filesystem Writes
• Support Writing data into the filesystem from
queries …
– INSERT OVERWRITE LOCAL? DIRECTORY
path=STRING rowFormat? createFileFormat?
– INSERT OVERWRITE LOCAL? DIRECTORY
(path=STRING)? tableProvider (OPTIONS
options=tablePropertyList)?
5

UDF Support
• UDAF_JAVA_F/UDTF_JAVA_F/UDF_JAVA_F
• UDF_Bind
• UDF_EVAL_F
• Non-deterministic Expression
• …
6

Narrowing Down Feature Gaps - Syntax
• Regex Column Specification
• Syntax parser improvement
• UDF compatibility
– Enum value
– User defined class type
– Lambda function

3X Workload Growth in 6 Month
Reserved CPU Days
CPU Days

Joins
Broadcast Join ShuffleHash Join SortMerge Join

Dynamic Join
Build Hash table
OOM
Hash
Join
Reconstruct
Iterator
Sort
MergeJoin
Start
End
No
Ye
s
• More aggressively
leverage HashJoin
• Provide a reliable
fallback mechanim

Bucket Join
Bucket 1
Bucket 2
Bucket 4
Bucket 2
Bucket 3
Bucket 4
Bucket 3
Bucket 1
Split 1
Split 2
Split 3
Split 4
Bucket 1
Bucket 2
Bucket 4
Bucket 2
Bucket 3
Bucket 1
Split 1
Split 2
• Support different number (multiplier) of buckets
on left/right side.

Bucket Join Validation
• To verify bucket join spark generate consistent
result to hive bucket join
– Read Spark/Hive Table.
– Zip the corresponding splits from spark/hive
generated tables.
– Compare the sorted column in two splits sequentially.
– Sort the bucket column in each split and compare
rows in two splits sequentially.
13

Challenges in Large Jobs
• A large job with 10,000 mapper * 10,000 reducer
– IOPS: 100,000,000
– HDFS: 10,000 result files
– Scheduling Overhead: 20,000 tasks
– Manual Tuning
– Data skewness
14

Pros and Cons
• Reduce IOPS
• Number of HDFS files
• Runtime Optimization
• Backward Compatibility
– Exactly same behavior with split number = 1
• Auto-Configuration
– 503 partitions and 13 buckets to achieve good
performance.
BUT
• Reduced Parallelism
• Need to fetch all before computation.

JIRA
• SPARK-12139
– REGEX Column Specification for Hive Queries
• SPARK-4131
– Support "Writing data into the filesystem from queries”
• SPARK-23306
– Race condition in TaskMemoryManager
• SPARK-19326
– Speculated task attempts do not get launched in few scenarios
• SPARK-19839
– Fix memory leak in BytesToBytesMap

ACKNOWLEDGEMENTS
The presentation includes the work from the Spark
team in Facebook. Thanks for their contribution,
esp., Lin Wang, Tejas Patil.

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

Physical Plans in Spark SQLDatabricks

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Apache Spark Core – Practical OptimizationDatabricks

Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks

At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention. Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.

What is in a Lucene index?lucenerevolution

Presented by Adrien Grand, Software Engineer, Elasticsearch Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.

Deep Dive into the New Features of Apache Spark 3.0Databricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong

Deep Dive: Memory Management in Apache SparkDatabricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

Flink Forward San Francisco 2022. Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover. by Mason Chen

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

This document discusses 5 common mistakes when writing Spark applications: 1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources. 2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this. 3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew. 4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible. 5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh

Apache Spark ArchitectureAlexey Grishchenko

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

Apache Drill talk ApacheCon 2018Aman Sinha

Apache Drill is a distributed SQL query engine that enables fast analytics over NoSQL databases and distributed file systems. It has a plugin-based architecture that allows it to access different data sources. For NoSQL databases, Drill leverages secondary indexes to generate index-based query plans for predicates on non-key columns. For distributed file systems like HDFS, Drill performs partition pruning based on directory metadata and filter pushdown based on Parquet row group statistics to speed up queries. Drill's extensible framework allows data sources to provide metadata like indexes, statistics, and partitioning functions to optimize query execution.

[262] netflix 빅데이터 플랫폼NAVER D2

This document summarizes a presentation about Netflix's big data platform and Spark. The key points are: 1. Netflix uses Apache Spark on YARN and Mesos clusters to process batch and streaming data from sources like Cassandra and Kafka. 2. Netflix has contributed improvements to Spark's dynamic resource allocation, predicate pushdown, and support for S3 filesystems. 3. A use case showed Spark outperforming Pig for an iterative job that duplicated and aggregated data in multiple steps.

More Related Content

What's hot (20)

What is in a Lucene index?lucenerevolution

Deep Dive into the New Features of Apache Spark 3.0Databricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Deep Dive: Memory Management in Apache SparkDatabricks

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Apache Spark ArchitectureAlexey Grishchenko

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

What is in a Lucene index?lucenerevolution

Deep Dive into the New Features of Apache Spark 3.0Databricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Deep Dive: Memory Management in Apache SparkDatabricks

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Apache Spark ArchitectureAlexey Grishchenko

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Similar to Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang (20)

Apache Drill talk ApacheCon 2018Aman Sinha

[262] netflix 빅데이터 플랫폼NAVER D2

Redshift Chartio Event PresentationChartio

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks

Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.

SQL Server 2014 In-Memory OLTPTony Rogerson

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Hive Evolution: ApacheCon NA 2010John Sichi

Hive is a data warehouse system for querying large datasets using SQL. Version 0.6 added views, multiple databases, dynamic partitioning, and storage handlers. Version 0.7 will focus on concurrency control, statistics collection, indexing, and performance improvements. Hive has become a top-level Apache project and aims to improve security, testing, and integration with other Hadoop components in the future.

SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson

Hekaton is large piece of kit, this session will focus on the internals of how in-memory tables and native stored procedures work and interact – Database structure: use of File Stream, backup/restore considerations in HA and DR as well as Database Durability, in-memory table make up: hash and range indexes, row chains, Multi-Version Concurrency Control (MVCC). Design considerations and gottcha’s to watch out for. The session will be demo led. Note: the session will assume the basics of Hekaton are known, so it is recommended you attend the Basics session.

SQL on Hadoopnvvrajesh

The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.

Hoodie - DataEngConf 2017Vinoth Chandar

An Open Source Incremental Processing Framework called Hoodie is summarized. Key points: - Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans. - It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data. - Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads. - The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.

HBaseCon2015-finalMaryann Xue

Apache Phoenix is a SQL query layer for Apache HBase that allows users to interact with HBase through JDBC. It transforms SQL queries into native HBase API calls to optimize execution across the HBase cluster in a parallel manner. The presentation covered Phoenix's current features like join support, new features like functional indexes and user defined functions, and the future integration with Apache Calcite to bring more SQL capabilities and a cost-based query optimizer to Phoenix. Overall, Phoenix provides a relational view of data stored in HBase to enable complex SQL queries to run efficiently on large datasets.

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

This document summarizes Unity Technologies' journey migrating their data pipeline from a legacy Hive-based system to using Spark. Some key points: - They moved to Spark for its scaling, performance, and ability to handle both batch and streaming workloads from a single stack. - The new Spark-based pipeline uses Airflow for workflow management and saves processed data to Parquet files stored in S3 for backup. - Taking a test-driven development approach with unit and integration tests helped ensure a smooth migration. Staging the pipeline in an environment similar to production also helped address issues early. - The new Spark pipeline completed analysis stages up to 2x faster than the previous Hive-based system and

Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang

Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets. If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).

Understanding Query Plans and Spark UIsDatabricks

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Spark real world use cases and optimizationsGal Marder

This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.

Apache hivepradipbajpai68

Dive into spark2Gal Marder

Abstract – Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications. Target Audience Architects, Java/Scala developers, Big Data engineers, team leaders Prerequisites Java/Scala knowledge and SQL knowledge Contents: - Spark internals - Architecture - RDD - Shuffle explained - Dataset API - Spark SQL - Spark Streaming

Yet another intro to Apache SparkSimon Lia-Jonassen

Spark is a framework for efficient parallel data processing. It uses resilient distributed datasets (RDDs) that can be operated on in parallel, cached in memory, and recomputed when needed. The core of Spark provides functions for data sharing and basic operations like filtering, mapping, and reducing RDDs. Additional Spark modules provide capabilities for SQL, streaming, machine learning, and graph processing.

Apache Drill talk ApacheCon 2018Aman Sinha

[262] netflix 빅데이터 플랫폼NAVER D2

Redshift Chartio Event PresentationChartio

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks

SQL Server 2014 In-Memory OLTPTony Rogerson

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Hive Evolution: ApacheCon NA 2010John Sichi

SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson

SQL on Hadoopnvvrajesh

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

Hoodie - DataEngConf 2017Vinoth Chandar

HBaseCon2015-finalMaryann Xue

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang

Understanding Query Plans and Spark UIsDatabricks

Spark real world use cases and optimizationsGal Marder

Apache hivepradipbajpai68

Dive into spark2Gal Marder

Yet another intro to Apache SparkSimon Lia-Jonassen

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data ScienceDatabricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML MonitoringDatabricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature AggregationsDatabricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and SparkDatabricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta LakeDatabricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

Voice Control robotic arm hggyghghgjgjhgjg4mg22ec401

Agricultural_regionalisation_in_India(Final).pptxmostafaahammed38

录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单Taqyea

保密服务多伦多都会大学英文毕业证书影本加拿大成绩单多伦多都会大学文凭【q微1954292140】办理多伦多都会大学学位证(TMU毕业证书)成绩单VOID底纹防伪【q微1954292140】帮您解决在加拿大多伦多都会大学未毕业难题（Toronto Metropolitan University）文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭（q微1954292140）新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证，买毕业证，毕业证购买，买大学文凭，购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证（q微1954292140）新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证，回国证明，留信网认证，留信认证办理，学历认证。从而完成就业。多伦多都会大学毕业证办理，多伦多都会大学文凭办理，多伦多都会大学成绩单办理和真实留信认证、留服认证、多伦多都会大学学历认证。学院文凭定制，多伦多都会大学原版文凭补办，扫描件文凭定做，100%文凭复刻。特殊原因导致无法毕业，也可以联系我们帮您办理相关材料：１：在多伦多都会大学挂科了，不想读了，成绩不理想怎么办？？？ 2：打算回国了，找工作的时候，需要提供认证《TMU成绩单购买办理多伦多都会大学毕业证书范本》【Q/WeChat：1954292140】Buy Toronto Metropolitan University Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办？？？加拿大毕业证购买，加拿大文凭购买，【q微1954292140】加拿大文凭购买，加拿大文凭定制，加拿大文凭补办。专业在线定制加拿大大学文凭，定做加拿大本科文凭，【q微1954292140】复制加拿大Toronto Metropolitan University completion letter。在线快速补办加拿大本科毕业证、硕士文凭证书，购买加拿大学位证、多伦多都会大学Offer，加拿大大学文凭在线购买。加拿大文凭多伦多都会大学成绩单，TMU毕业证【q微1954292140】办理加拿大多伦多都会大学毕业证(TMU毕业证书)【q微1954292140】学位证书电子图在线定制服务多伦多都会大学offer/学位证offer办理、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决多伦多都会大学学历学位认证难题。主营项目： 1、真实教育部国外学历学位认证《加拿大毕业文凭证书快速办理多伦多都会大学毕业证书不见了怎么办》【q微1954292140】《论文没过多伦多都会大学正式成绩单》，教育部存档，教育部留服网站100%可查. 2、办理TMU毕业证，改成绩单《TMU毕业证明办理多伦多都会大学学历认证定制》【Q/WeChat：1954292140】Buy Toronto Metropolitan University Certificates《正式成绩单论文没过》，多伦多都会大学Offer、在读证明、学生卡、信封、证明信等全套材料，从防伪到印刷，从水印到钢印烫金，高精仿度跟学校原版100%相同. 3、真实使馆认证（即留学人员回国证明），使馆存档可通过大使馆查询确认. 4、留信网认证，国家专业人才认证中心颁发入库证书，留信网存档可查. 《多伦多都会大学学位证购买加拿大毕业证书办理TMU假学历认证》【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。高仿真还原加拿大文凭证书和外壳，定制加拿大多伦多都会大学成绩单和信封。学历认证证书电子版TMU毕业证【q微1954292140】办理加拿大多伦多都会大学毕业证(TMU毕业证书)【q微1954292140】毕业证书样本多伦多都会大学offer/学位证学历本科证书、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决多伦多都会大学学历学位认证难题。多伦多都会大学offer/学位证、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy Toronto Metropolitan University Diploma购买美国毕业证，购买英国毕业证，购买澳洲毕业证，购买加拿大毕业证，以及德国毕业证，购买法国毕业证（q微1954292140）购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证，硕士毕业证。

Feature Engineering for Electronic Health Record SystemsProcess mining Evangelist

Oak Ridge National Laboratory (ORNL) is a leading science and technology laboratory under the direction of the Department of Energy. Hilda Klasky is part of the R&D Staff of the Systems Modeling Group in the Computational Sciences & Engineering Division at ORNL. To prepare the data of the radiology process from the Veterans Affairs Corporate Data Warehouse for her process mining analysis, Hilda had to condense and pre-process the data in various ways. Step by step she shows the strategies that have worked for her to simplify the data to the level that was required to be able to analyze the process with domain experts.

hersh's midterm project.pdf music retail and distributionhershtara1

Adopting Process Mining at the Rabobank - use caseProcess mining Evangelist

Frank van Geffen is a Process Innovator at the Rabobank. He realized that it took a lot of different disciplines and skills working together to achieve what they have achieved. It's not only about knowing what process mining is and how to operate the process mining tool. Instead, a lot of emphasis needs to be placed on the management of stakeholders and on presenting insights in a meaningful way for them. The results speak for themselves: In their IT service desk improvement project, they could already save 50,000 steps by reducing rework and preventing incidents from being raised. In another project, business expense claim turnaround time has been reduced from 11 days to 1.2 days. They could also analyze their cross-channel mortgage customer journey process.

How to regulate and control your it-outsourcing provider with process miningProcess mining Evangelist

Oliver Wildenstein is an IT process manager at MLP. As in many other IT departments, he works together with external companies who perform supporting IT processes for his organization. With process mining he found a way to monitor these outsourcing providers. Rather than having to believe the self-reports from the provider, process mining gives him a controlling mechanism for the outsourced process. Because such analyses are usually not foreseen in the initial outsourcing contract, companies often have to pay extra to get access to the data for their own process.

文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询Taqyea

保密服务圣地亚哥州立大学英文毕业证书影本美国成绩单圣地亚哥州立大学文凭【q微1954292140】办理圣地亚哥州立大学学位证(SDSU毕业证书)毕业证书购买【q微1954292140】帮您解决在美国圣地亚哥州立大学未毕业难题（San Diego State University）文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭（q微1954292140）新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证，买毕业证，毕业证购买，买大学文凭，购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证（q微1954292140）新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证，回国证明，留信网认证，留信认证办理，学历认证。从而完成就业。圣地亚哥州立大学毕业证办理，圣地亚哥州立大学文凭办理，圣地亚哥州立大学成绩单办理和真实留信认证、留服认证、圣地亚哥州立大学学历认证。学院文凭定制，圣地亚哥州立大学原版文凭补办，扫描件文凭定做，100%文凭复刻。特殊原因导致无法毕业，也可以联系我们帮您办理相关材料：１：在圣地亚哥州立大学挂科了，不想读了，成绩不理想怎么办？？？ 2：打算回国了，找工作的时候，需要提供认证《SDSU成绩单购买办理圣地亚哥州立大学毕业证书范本》【Q/WeChat：1954292140】Buy San Diego State University Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办？？？美国毕业证购买，美国文凭购买，【q微1954292140】美国文凭购买，美国文凭定制，美国文凭补办。专业在线定制美国大学文凭，定做美国本科文凭，【q微1954292140】复制美国San Diego State University completion letter。在线快速补办美国本科毕业证、硕士文凭证书，购买美国学位证、圣地亚哥州立大学Offer，美国大学文凭在线购买。美国文凭圣地亚哥州立大学成绩单，SDSU毕业证【q微1954292140】办理美国圣地亚哥州立大学毕业证(SDSU毕业证书)【q微1954292140】录取通知书offer在线制作圣地亚哥州立大学offer/学位证毕业证书样本、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决圣地亚哥州立大学学历学位认证难题。主营项目： 1、真实教育部国外学历学位认证《美国毕业文凭证书快速办理圣地亚哥州立大学办留服认证》【q微1954292140】《论文没过圣地亚哥州立大学正式成绩单》，教育部存档，教育部留服网站100%可查. 2、办理SDSU毕业证，改成绩单《SDSU毕业证明办理圣地亚哥州立大学成绩单购买》【Q/WeChat：1954292140】Buy San Diego State University Certificates《正式成绩单论文没过》，圣地亚哥州立大学Offer、在读证明、学生卡、信封、证明信等全套材料，从防伪到印刷，从水印到钢印烫金，高精仿度跟学校原版100%相同. 3、真实使馆认证（即留学人员回国证明），使馆存档可通过大使馆查询确认. 4、留信网认证，国家专业人才认证中心颁发入库证书，留信网存档可查. 《圣地亚哥州立大学学位证书的英文美国毕业证书办理SDSU办理学历认证书》【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。高仿真还原美国文凭证书和外壳，定制美国圣地亚哥州立大学成绩单和信封。毕业证网上可查学历信息SDSU毕业证【q微1954292140】办理美国圣地亚哥州立大学毕业证(SDSU毕业证书)【q微1954292140】学历认证生成授权声明圣地亚哥州立大学offer/学位证文凭购买、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决圣地亚哥州立大学学历学位认证难题。圣地亚哥州立大学offer/学位证、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy San Diego State University Diploma购买美国毕业证，购买英国毕业证，购买澳洲毕业证，购买加拿大毕业证，以及德国毕业证，购买法国毕业证（q微1954292140）购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证，硕士毕业证。

Sets theories and applications that can used to imporve knowledgesaumyasl2020

Improving Product Manufacturing ProcessesProcess mining Evangelist

Giancarlo Lepore works at Zimmer Biomet, Switzerland. Zimmer Biomet produces orthopedic products (for example, hip replacements) and one of the challenges is that each of the products has many variations that require customizations in the production process. Giancarlo is a business analyst in Zimmer Biomet’s operational intelligence team. He has introduced process mining to analyze the material flow in their production process. He explains why it is difficult to analyze the production process with traditional lean six sigma tools, such as spaghetti diagrams and value stream mapping. He compares process mining to these traditional process analysis methods and also shows how they were able to resolve data quality problems in their master data management in the ERP system.

AI ------------------------------ W1L2.pptxAyeshaJalil6

This lecture provides a foundational understanding of Artificial Intelligence (AI), exploring its history, core concepts, and real-world applications. Students will learn about intelligent agents, machine learning, neural networks, natural language processing, and robotics. The lecture also covers ethical concerns and the future impact of AI on various industries. Designed for beginners, it uses simple language, engaging examples, and interactive discussions to make AI concepts accessible and exciting. By the end of this lecture, students will have a clear understanding of what AI is, how it works, and where it's headed.

Transforming health care with ai poweredgowthamarvj

Z14_IBM__APL_by_Christian_Demmer_IBM.pdfFariborz Seyedloo

Process Mining at Dimension Data - Jan vermeulenProcess mining Evangelist

Dimension Data has over 30,000 employees in nine operating regions spread over all continents. They provide services from infrastructure sales to IT outsourcing for multinationals. As the Global Process Owner at Dimension Data, Jan Vermeulen is responsible for the standardization of the global IT services processes. Jan shares his journey of establishing process mining as a methodology to improve process performance and compliance, to grow their business, and to increase the value in their operations. These three pillars form the foundation of Dimension Data's business case for process mining. Jan shows examples from each of the three pillars and shares what he learned on the way. The growth pillar is particularly new and interesting, because Dimension Data was able to compete in a RfP process for a new customer by providing a customized offer after analyzing the customer's data with process mining.

CS-404 COA COURSE FILE JAN JUN 2025.docxnidarizvitit

Process Mining Machine Recoveries to Reduce DowntimeProcess mining Evangelist

ASML provides chip makers with everything they need to mass-produce patterns on silicon, helping to increase the value and lower the cost of a chip. The key technology is the lithography system, which brings together high-tech hardware and advanced software to control the chip manufacturing process down to the nanometer. All of the world’s top chipmakers like Samsung, Intel and TSMC use ASML’s technology, enabling the waves of innovation that help tackle the world’s toughest challenges. The machines are developed and assembled in Veldhoven in the Netherlands and shipped to customers all over the world. Freerk Jilderda is a project manager running structural improvement projects in the Development & Engineering sector. Availability of the machines is crucial and, therefore, Freerk started a project to reduce the recovery time. A recovery is a procedure of tests and calibrations to get the machine back up and running after repairs or maintenance. The ideal recovery is described by a procedure containing a sequence of 140 steps. After Freerk’s team identified the recoveries from the machine logging, they used process mining to compare the recoveries with the procedure to identify the key deviations. In this way they were able to find steps that are not part of the expected recovery procedure and improve the process.

What is ETL? Difference between ETL and ELT?.pdfSaikatBasu37

2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstrybastakwyry

CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™muhammed84essa

新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办Taqyea

快速办理新西兰成绩单奥克兰理工大学毕业证【q微1954292140】办理奥克兰理工大学毕业证(AUT毕业证书)diploma学位认证【q微1954292140】新西兰文凭购买，新西兰文凭定制，新西兰文凭补办。专业在线定制新西兰大学文凭，定做新西兰本科文凭，【q微1954292140】复制新西兰Auckland University of Technology completion letter。在线快速补办新西兰本科毕业证、硕士文凭证书，购买新西兰学位证、奥克兰理工大学Offer，新西兰大学文凭在线购买。主营项目： 1、真实教育部国外学历学位认证《新西兰毕业文凭证书快速办理奥克兰理工大学毕业证的方法是什么？》【q微1954292140】《论文没过奥克兰理工大学正式成绩单》，教育部存档，教育部留服网站100%可查. 2、办理AUT毕业证，改成绩单《AUT毕业证明办理奥克兰理工大学展示成绩单模板》【Q/WeChat：1954292140】Buy Auckland University of Technology Certificates《正式成绩单论文没过》，奥克兰理工大学Offer、在读证明、学生卡、信封、证明信等全套材料，从防伪到印刷，从水印到钢印烫金，高精仿度跟学校原版100%相同. 3、真实使馆认证（即留学人员回国证明），使馆存档可通过大使馆查询确认. 4、留信网认证，国家专业人才认证中心颁发入库证书，留信网存档可查. 《奥克兰理工大学毕业证定制新西兰毕业证书办理AUT在线制作本科文凭》【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。高仿真还原新西兰文凭证书和外壳，定制新西兰奥克兰理工大学成绩单和信封。专业定制国外毕业证书AUT毕业证【q微1954292140】办理新西兰奥克兰理工大学毕业证(AUT毕业证书)【q微1954292140】学历认证复核奥克兰理工大学offer/学位证成绩单定制、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决奥克兰理工大学学历学位认证难题。新西兰文凭奥克兰理工大学成绩单，AUT毕业证【q微1954292140】办理新西兰奥克兰理工大学毕业证(AUT毕业证书)【q微1954292140】学位认证要多久奥克兰理工大学offer/学位证在线制作硕士成绩单、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决奥克兰理工大学学历学位认证难题。奥克兰理工大学offer/学位证、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy Auckland University of Technology Diploma购买美国毕业证，购买英国毕业证，购买澳洲毕业证，购买加拿大毕业证，以及德国毕业证，购买法国毕业证（q微1954292140）购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证，硕士毕业证。特殊原因导致无法毕业，也可以联系我们帮您办理相关材料：１：在奥克兰理工大学挂科了，不想读了，成绩不理想怎么办？？？ 2：打算回国了，找工作的时候，需要提供认证《AUT成绩单购买办理奥克兰理工大学毕业证书范本》【Q/WeChat：1954292140】Buy Auckland University of Technology Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办？？？新西兰毕业证购买，新西兰文凭购买，【q微1954292140】帮您解决在新西兰奥克兰理工大学未毕业难题（Auckland University of Technology）文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭（q微1954292140）新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证，买毕业证，毕业证购买，买大学文凭，购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证（q微1954292140）新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证，回国证明，留信网认证，留信认证办理，学历认证。从而完成就业。奥克兰理工大学毕业证办理，奥克兰理工大学文凭办理，奥克兰理工大学成绩单办理和真实留信认证、留服认证、奥克兰理工大学学历认证。学院文凭定制，奥克兰理工大学原版文凭补办，扫描件文凭定做，100%文凭复刻。

Voice Control robotic arm hggyghghgjgjhgjg4mg22ec401

Agricultural_regionalisation_in_India(Final).pptxmostafaahammed38

录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单Taqyea

Feature Engineering for Electronic Health Record SystemsProcess mining Evangelist

hersh's midterm project.pdf music retail and distributionhershtara1

Adopting Process Mining at the Rabobank - use caseProcess mining Evangelist

How to regulate and control your it-outsourcing provider with process miningProcess mining Evangelist

文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询Taqyea

Sets theories and applications that can used to imporve knowledgesaumyasl2020

Improving Product Manufacturing ProcessesProcess mining Evangelist

AI ------------------------------ W1L2.pptxAyeshaJalil6

Transforming health care with ai poweredgowthamarvj

Z14_IBM__APL_by_Christian_Demmer_IBM.pdfFariborz Seyedloo

Process Mining at Dimension Data - Jan vermeulenProcess mining Evangelist

CS-404 COA COURSE FILE JAN JUN 2025.docxnidarizvitit

Process Mining Machine Recoveries to Reduce DowntimeProcess mining Evangelist

What is ETL? Difference between ETL and ELT?.pdfSaikatBasu37

2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstrybastakwyry

CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™muhammed84essa

新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办Taqyea

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

1. Zhan Zhang, Jane Wang, Facebook Migrating Apache Hive Workload to Apache Spark: Bridge the Gap

2. Overview • Hive to Spark Migration Effort • Narrowing Down Feature Gaps – Regex Column Specification Support. – Local Writes support. – UDFs • Performance and Reliability – Dynamic Join – Bucket Join • Advanced Optimization for Extremely Large Jobs – Secondary Partitioning – Run-time Optimization.

3. • Why do we migrate workload from hive to Spark – Performance – Identify and narrow down the feature gap. Hive to Spark Migration

4. Regex Column Specification Support • One of the most failures in our syntax analysis. • Support regex column specification. – SELECT `(a)?+.+` FROM data table – SELECT t.`(a)?+.+` FROM data table • SPARK-12139 4put your #assignedhashtag here by setting the footer in view-header/footer

5. Local Filesystem Writes • Support Writing data into the filesystem from queries … – INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? createFileFormat? – INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider (OPTIONS options=tablePropertyList)? 5

6. UDF Support • UDAF_JAVA_F/UDTF_JAVA_F/UDF_JAVA_F • UDF_Bind • UDF_EVAL_F • Non-deterministic Expression • … 6

7. Narrowing Down Feature Gaps - Syntax • Regex Column Specification • Syntax parser improvement • UDF compatibility – Enum value – User defined class type – Lambda function

8. 3X Workload Growth in 6 Month Reserved CPU Days CPU Days

9. Joins Broadcast Join ShuffleHash Join SortMerge Join

10. Dynamic Join Build Hash table OOM Hash Join Reconstruct Iterator Sort MergeJoin Start End No Ye s • More aggressively leverage HashJoin • Provide a reliable fallback mechanim

11. Dynamic Join – Physical Plan

12. Bucket Join Bucket 1 Bucket 2 Bucket 4 Bucket 2 Bucket 3 Bucket 4 Bucket 3 Bucket 1 Split 1 Split 2 Split 3 Split 4 Bucket 1 Bucket 2 Bucket 4 Bucket 2 Bucket 3 Bucket 1 Split 1 Split 2 • Support different number (multiplier) of buckets on left/right side.

13. Bucket Join Validation • To verify bucket join spark generate consistent result to hive bucket join – Read Spark/Hive Table. – Zip the corresponding splits from spark/hive generated tables. – Compare the sorted column in two splits sequentially. – Sort the bucket column in each split and compare rows in two splits sequentially. 13

14. Challenges in Large Jobs • A large job with 10,000 mapper * 10,000 reducer – IOPS: 100,000,000 – HDFS: 10,000 result files – Scheduling Overhead: 20,000 tasks – Manual Tuning – Data skewness 14

15. Advanced - Secondary Partitioning

16. Pros and Cons • Reduce IOPS • Number of HDFS files • Runtime Optimization • Backward Compatibility – Exactly same behavior with split number = 1 • Auto-Configuration – 503 partitions and 13 buckets to achieve good performance. BUT • Reduced Parallelism • Need to fetch all before computation.

17. Runtime Join Optimization

18. JIRA • SPARK-12139 – REGEX Column Specification for Hive Queries • SPARK-4131 – Support "Writing data into the filesystem from queries” • SPARK-23306 – Race condition in TaskMemoryManager • SPARK-19326 – Speculated task attempts do not get launched in few scenarios • SPARK-19839 – Fix memory leak in BytesToBytesMap

19. ACKNOWLEDGEMENTS The presentation includes the work from the Spark team in Facebook. Thanks for their contribution, esp., Lin Wang, Tejas Patil.

20. Question?

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

Recommended

More Related Content

What's hot (20)

Similar to Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang (20)

More from Databricks (20)

Recently uploaded (20)

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang