SlideShare a Scribd company logo
SQL in Hadoop

Munich, 21 January 2014
Sven Bayer
QAware GmbH

■45 employees
■Software Engineering
■Quality
■Agility

■Projects
■Software diagnosis
■Individual Software solutions

■Customers

■Automotive, Energy, Retail,
Telecommunications, and others
21 January 2014

QAware

3
Agenda

1. Motivation
2. Big Data
3. MapReduce
4. Hadoop

5. Hive
6. Hive Query Compiler
7. Discussion

21 January 2014

QAware

4
Motivation

Masterthesis:
Software-metrics + time tracking

■Hadoop processes
huge data on clusters

■Hive provides

SQL for Hadoop

■Hive generates complex
MapReduce jobs from SQL

■How does Hive convert
SQL to MapReduce?

21 January 2014

QAware

5
Big Data

■Defined by 4 V‘s
■Volume
■Velocity
■Variety
■Veracity

21 January 2014

QAware

6
MapReduce

■In 2004 published by Google
■MapReduce is highly scalable on clusters
■Big Data can be processed with MapReduce
■Consists mainly of a Map and Reduce function

21 January 2014

QAware

7
MapReduce

Example: Word-Count-Algorithm
Input
Hadoop uses
MapReduce.

Map

Sort, Shuffle

{Hadoop,1},
{uses,1},
{MapReduce,1}

{Hadoop,1}
{uses,1}
{MapReduce,1}

There is
a Map phase.

There is a
Reduce phase,
a Map phase.

There is
a Map phase.

21 January 2014

{There,1}, {is,1},
{a,1},
{Map,1}, {phase,1}

{There,1}, {is,1},
{a,1}, {Reduce,1},
{phase,1}, {a,1},
{Map,1}, {phase,1}

{There,1}, {is,1},
{a,1}, {Map,1},
{phase,1}

QAware

{There,1}, {There,1},
{There, 1}
{is,1}, {is,1},
{is, 1},
{a,1}, {a,1},
{a,1}, {a,1}

Reduce

Output

{Hadoop,[1]},
{uses,[1]},
{MapReduce,[1]}

Hadoop
1
uses
1
MapReduce 1

{There,[1,1,1]},
{is,[1,1,1]},
{a,[1,1,1,1]}

There
is
a

3
3
4

Map
phase
Reduce

3
4
1

{Map,1}, {Map,1},
{Map, 1}
{phase,1}, {phase,1},
{phase, 1}, {phase,1}

{Map,[1,1,1]},
{phase,[1,1,1,1]},
{Reduce,[1]}

{Reduce,1}

8
MapReduce – In practice

■ Get the users with the products that they watched
■ Get these products with their numbers, makers, price
and filter the products on „audi“
Map

Input

Sort, Shuffle

Reduce

Output

Join on product_no
+
{pNo1,[user1,user2,{audi,30€}]},
{pNo2,[user3,{audi,50€}]}

pNo1 user1,audi,30€,
pNo1 user2,audi,30€,
pNo2 user3,audi,50€

access_log
id1

user1

pNo1

id2

user2

pNo1

id3

user3

{pNo1,user1},
{pNo1,user2},
{pNo2,user3}

pNo2

product
pNo1

audi

30€

pNo2

audi

50€

pNo3

bmw

60€

21 January 2014

{pNo1,{audi,30€}},
{pNo2,{audi,50€}},
{pNo3,{bmw,60€}}
+
Filtering on „audi“

QAware

{pNo1,user1},
{pNo1,user2},
{pNo1,{audi,30€}}

{pNo2,user3},
{pNo2,{audi,50€}}

9
Hadoop

■In 2006 initiated by Yahoo
■Hadoop cluster
■Highly scalalbe for Big Data

■Hadoop architecture
YARN (MapReduce)
HDFS
Hadoop Common

21 January 2014

QAware

10
Hive

■Built on top of Hadoop
■MapReduce
■HDFS

■Provides HiveQL queries for Hadoop
■Compiles HiveQL to MapReduce

21 January 2014

QAware

11
Hive

Hive architecture

JDBC/
ODBC

CLI

Web-UI

Legend
Thrift
Server

Framework

Call of a component

Component
Driver
Query Compiler

Parser
Semantic Analyzer
Metastore

Logical Plan Generator
Logical Optimizer

Component

Physical Plan Generator

Framework

Execution
Engine

Physical Optimizer

21 January 2014

QAware

12
Hive

Hive Query Compiler
Start

HiveQL

Parser

AST

Semantic
Analyzer

QB

Logical
Plan
Generator

QB
Tree

Logical
Optimizer

QB
Tree

Physical
Plan
Generator

Phys.
Plan

Physical
Optimizer

Phys.
Plan

Execution
Engine
21 January 2014

QAware

End
Hive Query Compiler

Parser

Parser

HiveQL

SELECT a.user, a.product_no, p.maker, p.price
FROM access_log a JOIN product p
ON (a.product_no = p.product_no)
WHERE p.maker = `audi`;

AST

access_log
id1

user1

pNo1

id2

user2

pNo1

id3

user3

pNo2

product
pNo1

21 January 2014

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

audi

50€

pNo3

Semantic
Analyzer

30€

pNo2

Parser

audi
bmw

60€

Execution
Engine
15
Hive Query Compiler

Parser

HiveQL

Parser

AST

■… WHERE p.maker = `audi`;

Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
16
Hive Query Compiler

Semantic Analyzer

AST

Semantic
Analyzer

QB

Query Block FROM-Clause
MetaData

ParseInfo

Alias to Table Info
“a”=Table Info(“access_log”)
“p”=Table Info(“product”)

Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

AST of Join-Expression

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
17
Hive Query Compiler

Logical Plan Generator

QB

Logical Plan
Generator

TableScanOperator
TS_1

TableScanOperator
TS_0

ReduceSinkOperator
RS_2

QB
Tree

ReduceSinkOperator
RS_3

JoinOperator
JOIN_4

FilterOperator
FIL_5
(maker = ‘audi’)
SelectOperator
SEL_6

FileSinkOperator
FS_7
Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
18
Hive Query Compiler

Logical Optimizer

QB
Tree

Logical
Optimizer

QB
Tree

TableScanOperator
TS_0

TableScanOperator
TS_1

FilterOperator
FIL_8
(maker = ‘audi’)
ReduceSinkOperator
RS_2

ReduceSinkOperator
RS_3

JoinOperator
JOIN_4

SelectOperator
SEL_6

FileSinkOperator
FS_7
Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
19
Hive Query Compiler

Phyiscal Plan Generator

QB
Tree

Physical Plan
Generator

Phys.
Plan

MapRedTask (Stage-1/root)
Mapper

Mapper

TableScanOperator
TS_1

TableScanOperator
TS_0
FilterOperator
FIL_8
(maker= ‘audi’)

ReduceSinkOperator
RS_2

Reducer

ReduceSinkOperator
RS_3

JoinOperator
JOIN_4
SelectOperator
SEL_6
FileSinkOperator
FS_7

MoveTask (Stage-0)
StatsTask (Stage-2)
Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
20
HiveQL-Verarbeitung

Physical Optimizer

Phys.
Plan

Physical
Optimizer

Phys.
Plan

■Optimizes the Physical Plan
■Transforms a plan with Joins to multiple MapReduce jobs
■Converts tasks including a Join to a MapJoin

Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
21
HiveQL-Verarbeitung

Execution Engine

Phys.
Plan

Execution
Engine

■MapReduce job is serialized as plan.xml
■Returns the result
■Temporary place
■Table

Parser
21 January 2014

Semantic
Analyzer

Logical Plan
Generator
QAware

Logical
Optimizer

Physical Plan
Generator

Physical
Optimizer

Execution
Engine
22
Discussion

■Hive brings SQL to Hadoop
■Advantages of Hive
■Reduces developer workload
■No need for manual coding of MapReduce jobs
■Easy migration for systems interacting with SQL

■Disadvantages of Hive
■High latency

■Outlook for Hive
■Apache Tez with container reusage, Mapper reduction in DAG

■Alternatives for Hive

■Impala, Shark, Presto, Lingual
21 January 2014

QAware

23

More Related Content

What's hot (20)

Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Sumeet Singh
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
Jan Pieter Posthuma
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop
HadoopHadoop
Hadoop
Nishant Gandhi
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Sumeet Singh
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 

Viewers also liked (8)

Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
David Kaiser
 
Hive
HiveHive
Hive
Min Zhou
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
Qubole
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
Qubole
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
Dmitry Tolpeko
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
David Kaiser
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
Qubole
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
Qubole
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
Dmitry Tolpeko
 

Similar to SQL in Hadoop (20)

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Jen Stirrup
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
PAVANKUMARNOOKALA
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
AnandMHadoop
 
Hive
HiveHive
Hive
Srinath Reddy
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
ssuserf8f9b2
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander K
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
sheetal sharma
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 
SQL In/On/Around Hadoop
SQL In/On/Around Hadoop SQL In/On/Around Hadoop
SQL In/On/Around Hadoop
DataWorks Summit
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
 
Hive.pptx
Hive.pptxHive.pptx
Hive.pptx
MahakSingh12
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Apache Hive
Apache HiveApache Hive
Apache Hive
tusharsinghal58
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Jen Stirrup
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander K
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 

Recently uploaded (20)

Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
PSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdfPSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdf
ssuser3d62c6
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
TAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdfTAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdf
Pallavi Sharma
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s ComingApache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
ShapeBlue
 
Pushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K HostsPushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K Hosts
ShapeBlue
 
GraphSummit Singapore Master Deck - May 20, 2025
GraphSummit Singapore Master Deck - May 20, 2025GraphSummit Singapore Master Deck - May 20, 2025
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025
Peter Morgan
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc Webinar: Cross-Border Data Transfers in 2025TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc
 
Stretching CloudStack over multiple datacenters
Stretching CloudStack over multiple datacentersStretching CloudStack over multiple datacenters
Stretching CloudStack over multiple datacenters
ShapeBlue
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
Breaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP DevelopersBreaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP Developers
pmeth1
 
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
PSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdfPSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdf
ssuser3d62c6
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
TAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdfTAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdf
Pallavi Sharma
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s ComingApache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
ShapeBlue
 
Pushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K HostsPushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K Hosts
ShapeBlue
 
GraphSummit Singapore Master Deck - May 20, 2025
GraphSummit Singapore Master Deck - May 20, 2025GraphSummit Singapore Master Deck - May 20, 2025
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025
Peter Morgan
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc Webinar: Cross-Border Data Transfers in 2025TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc Webinar: Cross-Border Data Transfers in 2025
TrustArc
 
Stretching CloudStack over multiple datacenters
Stretching CloudStack over multiple datacentersStretching CloudStack over multiple datacenters
Stretching CloudStack over multiple datacenters
ShapeBlue
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
Breaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP DevelopersBreaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP Developers
pmeth1
 

SQL in Hadoop

  • 1. SQL in Hadoop Munich, 21 January 2014 Sven Bayer
  • 2. QAware GmbH ■45 employees ■Software Engineering ■Quality ■Agility ■Projects ■Software diagnosis ■Individual Software solutions ■Customers ■Automotive, Energy, Retail, Telecommunications, and others 21 January 2014 QAware 3
  • 3. Agenda 1. Motivation 2. Big Data 3. MapReduce 4. Hadoop 5. Hive 6. Hive Query Compiler 7. Discussion 21 January 2014 QAware 4
  • 4. Motivation Masterthesis: Software-metrics + time tracking ■Hadoop processes huge data on clusters ■Hive provides SQL for Hadoop ■Hive generates complex MapReduce jobs from SQL ■How does Hive convert SQL to MapReduce? 21 January 2014 QAware 5
  • 5. Big Data ■Defined by 4 V‘s ■Volume ■Velocity ■Variety ■Veracity 21 January 2014 QAware 6
  • 6. MapReduce ■In 2004 published by Google ■MapReduce is highly scalable on clusters ■Big Data can be processed with MapReduce ■Consists mainly of a Map and Reduce function 21 January 2014 QAware 7
  • 7. MapReduce Example: Word-Count-Algorithm Input Hadoop uses MapReduce. Map Sort, Shuffle {Hadoop,1}, {uses,1}, {MapReduce,1} {Hadoop,1} {uses,1} {MapReduce,1} There is a Map phase. There is a Reduce phase, a Map phase. There is a Map phase. 21 January 2014 {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Reduce,1}, {phase,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} QAware {There,1}, {There,1}, {There, 1} {is,1}, {is,1}, {is, 1}, {a,1}, {a,1}, {a,1}, {a,1} Reduce Output {Hadoop,[1]}, {uses,[1]}, {MapReduce,[1]} Hadoop 1 uses 1 MapReduce 1 {There,[1,1,1]}, {is,[1,1,1]}, {a,[1,1,1,1]} There is a 3 3 4 Map phase Reduce 3 4 1 {Map,1}, {Map,1}, {Map, 1} {phase,1}, {phase,1}, {phase, 1}, {phase,1} {Map,[1,1,1]}, {phase,[1,1,1,1]}, {Reduce,[1]} {Reduce,1} 8
  • 8. MapReduce – In practice ■ Get the users with the products that they watched ■ Get these products with their numbers, makers, price and filter the products on „audi“ Map Input Sort, Shuffle Reduce Output Join on product_no + {pNo1,[user1,user2,{audi,30€}]}, {pNo2,[user3,{audi,50€}]} pNo1 user1,audi,30€, pNo1 user2,audi,30€, pNo2 user3,audi,50€ access_log id1 user1 pNo1 id2 user2 pNo1 id3 user3 {pNo1,user1}, {pNo1,user2}, {pNo2,user3} pNo2 product pNo1 audi 30€ pNo2 audi 50€ pNo3 bmw 60€ 21 January 2014 {pNo1,{audi,30€}}, {pNo2,{audi,50€}}, {pNo3,{bmw,60€}} + Filtering on „audi“ QAware {pNo1,user1}, {pNo1,user2}, {pNo1,{audi,30€}} {pNo2,user3}, {pNo2,{audi,50€}} 9
  • 9. Hadoop ■In 2006 initiated by Yahoo ■Hadoop cluster ■Highly scalalbe for Big Data ■Hadoop architecture YARN (MapReduce) HDFS Hadoop Common 21 January 2014 QAware 10
  • 10. Hive ■Built on top of Hadoop ■MapReduce ■HDFS ■Provides HiveQL queries for Hadoop ■Compiles HiveQL to MapReduce 21 January 2014 QAware 11
  • 11. Hive Hive architecture JDBC/ ODBC CLI Web-UI Legend Thrift Server Framework Call of a component Component Driver Query Compiler Parser Semantic Analyzer Metastore Logical Plan Generator Logical Optimizer Component Physical Plan Generator Framework Execution Engine Physical Optimizer 21 January 2014 QAware 12
  • 13. Hive Query Compiler Parser Parser HiveQL SELECT a.user, a.product_no, p.maker, p.price FROM access_log a JOIN product p ON (a.product_no = p.product_no) WHERE p.maker = `audi`; AST access_log id1 user1 pNo1 id2 user2 pNo1 id3 user3 pNo2 product pNo1 21 January 2014 Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer audi 50€ pNo3 Semantic Analyzer 30€ pNo2 Parser audi bmw 60€ Execution Engine 15
  • 14. Hive Query Compiler Parser HiveQL Parser AST ■… WHERE p.maker = `audi`; Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 16
  • 15. Hive Query Compiler Semantic Analyzer AST Semantic Analyzer QB Query Block FROM-Clause MetaData ParseInfo Alias to Table Info “a”=Table Info(“access_log”) “p”=Table Info(“product”) Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware AST of Join-Expression Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 17
  • 16. Hive Query Compiler Logical Plan Generator QB Logical Plan Generator TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 QB Tree ReduceSinkOperator RS_3 JoinOperator JOIN_4 FilterOperator FIL_5 (maker = ‘audi’) SelectOperator SEL_6 FileSinkOperator FS_7 Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 18
  • 17. Hive Query Compiler Logical Optimizer QB Tree Logical Optimizer QB Tree TableScanOperator TS_0 TableScanOperator TS_1 FilterOperator FIL_8 (maker = ‘audi’) ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 19
  • 18. Hive Query Compiler Phyiscal Plan Generator QB Tree Physical Plan Generator Phys. Plan MapRedTask (Stage-1/root) Mapper Mapper TableScanOperator TS_1 TableScanOperator TS_0 FilterOperator FIL_8 (maker= ‘audi’) ReduceSinkOperator RS_2 Reducer ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 MoveTask (Stage-0) StatsTask (Stage-2) Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 20
  • 19. HiveQL-Verarbeitung Physical Optimizer Phys. Plan Physical Optimizer Phys. Plan ■Optimizes the Physical Plan ■Transforms a plan with Joins to multiple MapReduce jobs ■Converts tasks including a Join to a MapJoin Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 21
  • 20. HiveQL-Verarbeitung Execution Engine Phys. Plan Execution Engine ■MapReduce job is serialized as plan.xml ■Returns the result ■Temporary place ■Table Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 22
  • 21. Discussion ■Hive brings SQL to Hadoop ■Advantages of Hive ■Reduces developer workload ■No need for manual coding of MapReduce jobs ■Easy migration for systems interacting with SQL ■Disadvantages of Hive ■High latency ■Outlook for Hive ■Apache Tez with container reusage, Mapper reduction in DAG ■Alternatives for Hive ■Impala, Shark, Presto, Lingual 21 January 2014 QAware 23
  翻译: