Cloudera Impala: A modern SQL Query Engine for Hadoop

Cloudera
Impala:
A
Modern
SQL
USE
PUBLICLY

DO
NOT

Query
Engine
for
Hadoop
PRIOR
TO
10/23/12

Headline
Goes
Here

JusJn
Erickson
|

Product
Manager

Speaker
Name
or
Subhead
Goes
Here

January
2013

Agenda

•  Intro
to
Impala

•  Impala’s
Architecture

•  Comparisons

Confidential. ©2013 Cloudera, Inc. All
2
Rights Reserved.

Why
Hadoop?

•  Scalability

•  Simply
scales
just
by
adding
nodes

•  Local
processing
to
avoid
network
boTlenecks

•  Flexibility

•  All
kinds
of
data
(blobs,
documents,
records,
etc)

•  In
all
forms
(structured,
semi-‐structured,
unstructured)

•  Store
anything
then
later
analyze
what
you
need

•  Efficiency

•  Cost
efficiency
(<$1k/TB)
on
commodity
hardware

•  Unified
storage,
metadata,
security
(no
duplicaJon
or
synchronizaJon)

Rights Reserved.

What’s
Impala?

•  Interac<ve
SQL

•  Typically
4-‐35x
faster
than
Hive
(observed
up
to
100x
faster)

•  Responses
in
seconds
instead
of
minutes
(someJmes
sub-‐second)

•  Nearly
ANSI-‐92
standard
SQL
queries
with
HiveQL

•  CompaJble
SQL
interface
for
exisJng
Hadoop/CDH
applicaJons

•  Based
on
industry
standard
SQL

•  Na<vely
on
Hadoop/HBase
storage
and
metadata

•  Flexibility,
scale,
and
cost
advantages
of
Hadoop

•  No
duplicaJon/synchronizaJon
of
data
and
metadata

•  Local
processing
to
avoid
network
boTlenecks

•  Separate
run<me
from
MapReduce

•  MapReduce
is
designed
and
great
for
batch

•  Impala
is
purpose-‐built
for
low-‐latency
SQL
queries
on
Hadoop

4
Rights Reserved.

So
what?

•  Interac<ve
BI/analy<cs

•  BI
tools
impracJcal
on
Hadoop
before
Impala

•  Move
from
10s
of
Hadoop
users
per
cluster
to
100s
of
SQL
users

•  More
and
faster
value
from
“big
data”

•  ELT/data
processing
with
<ght
SLAs

•  Sub-‐minute
SLAs
now
possible

•  Cost
eﬃciency

•  Fewer
nodes
to
meet
response
Jme
SLAs

5
Rights Reserved.

Impala
Architecture

•  Two
binaries:
impalad
and
statestored

•  Impala
daemon
(impalad)

•  one
Impala
daemon
on
each
node
with
data

•  handles
external
client
requests
and
all
internal
requests

related
to
query
execuJon

•  State
store
daemon
(statestored)

•  provides
name
service
and
metadata
distribuJon

Rights Reserved.

Impala
Architecture:
Query
ExecuJon
Phases

•  Request
arrives
via
ODBC/JDBC/Beeswax/Shell

•  Planner
turns
request
into
collecJons
of
plan
fragments

•  Coordinator
iniJates
execuJon
on
impalad's
local
to
data

•  During
execuJon:

•  intermediate
results
are
streamed
between
executors

•  query
results
are
streamed
back
to
client

•  subject
to
limitaJons
imposed
to
blocking
operators
(top-‐n,

aggregaJon)

Rights Reserved.

Impala
Architecture:
Planner

•  Example:
query
with
join
and
aggregaJon

SELECT
state,
SUM(revenue)

FROM
HdfsTbl
h
JOIN
HbaseTbl
b
ON
(...)

GROUP
BY
1
ORDER
BY
2
desc
LIMIT
10

TopN
Agg

TopN

Agg
Hash

Hash
Agg
Join

Join
Hdfs
Hbase

Exch
Exch

Hdfs
Hbase
Scan
Scan

at
coordinator
at
DataNodes
at
region
servers

Scan
Scan

Rights Reserved.

Impala
Architecture:
Query
ExecuJon

•  Request
arrives
via
ODBC/JDBC/Beeswax/Shell

SQL
App
Hive

Metastore

HDFS
NN
Statestore

ODBC

SQL

request

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Executor
Query
Executor
Query
Executor

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

Rights Reserved.

Impala
Architecture:
Query
ExecuJon

•  Planner
turns
request
into
collecJons
of
plan
fragments

•  Coordinator
iniJates
execuJon
on
impalad's
local
to
data

SQL
App
Hive

Metastore

HDFS
NN
Statestore

ODBC

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Executor
Query
Executor
Query
Executor

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

Rights Reserved.

Impala
Architecture:
Query
ExecuJon

•  Intermediate
results
are
streamed
between
impalad’s

•  Query
results
are
streamed
back
to
client

SQL
App
Hive

Metastore

HDFS
NN
Statestore

ODBC

query

results

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Executor
Query
Executor
Query
Executor

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

Rights Reserved.

Impala
and
Hive

•  Shared
with
Hive:

•  Metadata
(table
deﬁniJons)

•  ODBC/JDBC
drivers

•  Hue
Beeswax

•  SQL
syntax
(HiveQL)

•  Flexible
ﬁle
formats

•  Machine
pool

•  Improvements:

•  Purpose-‐built
query
engine
direct
on
HDFS
and
HBase

•  No
JVM
startup
and
no
MapReduce

•  In-‐memory
data
transfers

•  NaJve
distributed
relaJonal
query
engine

Rights Reserved.

What
about
an
EDW/RDBMS?

•  “Right
tool
for
the
right
job”

•  EDW/RDBMS
great
for:

•  OLTP’s
complex
transacJons

•  Highly
planned
and
opJmized
known
workloads

•  Opera4onal
reports
and
drill
into
repeated
known
queries

•  Impala’s
great
for:

•  Exploratory
analy4cs
with
new
previously-‐unknown
queries

•  Queries
on
big
and
growing
data
sets

•  EDW/RDBMS
can’t:

•  Dump
in
raw
data
then
later
deﬁne
schema
and
query
what
you
want

•  Evolve
schemas
without
an
expensive
schema
upgrade
planning
process

•  Simply
scales
just
by
adding
nodes

•  Store
at
<
$1k/TB
instead
of
$10-‐150k/TB

13
Rights Reserved.

AlternaJve
Hadoop
Query
Approaches

MapReduce
Remote
Query
Side
Storage

Query
Query
Query
Query

Node
Node
Node
Node
Query
MR

Hive
Engine

MR
OR
MR
DN

NN

DN
HDFS

DN
DN
DN

High-‐latency
MR
Network
boTleneck
Query
subset
of
data

Separate
nodes
for
SQL/MR
Separate
nodes
for
SQL/MR
RDBMS
rigid
schema

Duplicate
metadata,
Duplicate
metadata,
Duplicate
storage,

security,
SQL,
MR,
etc.
security,
SQL,
MR,
etc.
metadata,
security,
SQL,

etc.

Rights Reserved.

Comparing
Impala
to
Dremel

•  What
is
Dremel:

•  columnar
storage
for
data
with
nested
structures

•  distributed
scalable
aggregaJon
on
top
of
that

•  Columnar
storage
in
Hadoop:
joint
project
between
Cloudera
and
TwiTer

•  new
columnar
format,
derived
from
Doug
Culng's
Trevni

•  stores
data
in
appropriate
naJve/binary
types

•  can
also
store
nested
structures
similar
to
Dremel's
ColumnIO

•  Distributed
aggregaJon:
Impala

•  Impala
plus
columnar
format:
a
superset
of
the
published
version
of
Dremel

(which
didn't
support
joins
and
mulJple
ﬁle
formats)

Rights Reserved.

Impala
Roadmap

•  GA
(target
April
2013)

•  All
CDH4
OSes:
RHEL/CentOS,
Ubuntu,
Debian,
SLES

•  JDBC
driver

•  More
formats:
Avro,
LZO-‐compressed

•  Columnar
format

•  MR/Impala
resource
isolaJon

•  Perf
(joins,
aggregaJons,
SQL
features)

•  AutomaJc
metadata
distribuJon

•  Post-‐GA
top
requests:

•  UDFs

•  Memory
caching

•  Nested
data

•  Window
funcJons

16
Rights Reserved.

Validated
Beta
Partners

POWERED BY

IMPALA

Rights Reserved.

Cloudera Impala: A modern SQL Query Engine for Hadoop

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Cloudera Impala: A modern SQL Query Engine for Hadoop (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Cloudera Impala: A modern SQL Query Engine for Hadoop