002 Introduction to hadoop v3

Introduction to Apache Hadoop
Instructor
Dendej Sawarnkatat
dendej@gmail.com

Agenda
• Big Data Computation
• Introduction to Hadoop
• Hadoop Architecture
• MapReduce
• Hadoop Ecosystems
2

Traditional Approach
• Enterprise Computation
4
Large
Data
Processed By Powerful
computer

Traditional Approach
5
Big
Data
Processing limit Powerful
computer
Only so much
data could be
processed

Breaking Down the Data
Big
Data
Is broken into
pieces
6

Moving Computation to Data
• Concurrent Computation of Smaller Data
Big
Data
Combined
result
COMPUTATION
7

Fault Tolerance is a “MUST”
9

Distributed Computing
• The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
11

CAP Theorem
• CAP theorem (or Brewer’s theorem) is a set of
basic requirements that describes a distributed
system
• Consistency: all the server in the system will have
the same data
• Availability: all the server in the system will be
available and they will return all the data available
(also if they could be not consistent across the
system)
• Partition (tolerance): the system will continues to
operate as a whole despite arbitrary message loss
or failure of a part of the system
12

CAP Theorem (2)
14
According to
the theorem, a
distributed
system
CANNOT satisfy
all the three
requirements
at the SAME
time (“two out
of three”
concept).

Problems In Distributed
Computing
1. Hardware Failure:
• As soon as we start using many pieces of
hardware, the chance that one will fail is
fairly high.
2. Combine the data after analysis:
• Most analysis tasks need to be able to
combine the data in some way; data read
from one disk may need to be combined
with the data from any of the other 99
disks. 15

What’s Hadoop?
• “An open source software platform for
distributed storage and distributed
processing of very large data sets on
computer clusters built on commodity
hardware” - Hortonworks.
• Solving the first problem by Avoiding data
loss through replication
• redundant copies of the data are kept by
the system so that in the event of failure 17

What’s Hadoop? (cont’d)
• The second problem is solved by a simple
programming model called Mapreduce.
• Hadoop is also a highly popular open
source implementation of MapReduce
• a powerful tool designed for deep analysis
and transformation of very large data sets.
18

Hadoop…
• Where it comes from?
• The “legend” says that the name comes from
Doug Cutting (one of the founder of the
project) son’s toy elephant.
• So it is also the logo of the yellow smiling
elephant.
19

History
• [2002] Hadoop, created by Doug Cutting
(part of the Lucene project), starts as an
Open Source search engine for the Web.
• It has its origins in Apache Nutch, parts of
the Lucene project (full text search engine).
• [2003] Google publishes a paper
describing its own distributed file system,
also called GFS.
20

History (1)
• [2004] The first version of NDFS, Nutch
Distributed FS, implementing the Google’s
paper.
• [2004] Google publishes, another, paper
introducing the MapReduce algorithm
• [2005] The first version of MapReduce is
implemented in Nutch
21

History (2)
• [2005 (end)] Nutch’s MapReduce is
running on NDFS
• [2006 (Feb)] Nutch’s MapReduce and
NDFS became the core of a new Lucene’s
subproject.
• [2008] Yahoo launches the World’s largest
Hadoop PRODUCTION site
22

Key Features
23
• Automatic parallelization and distribution
• Fault-tolerance
• Data Locality
• Writing the Map and Reduce functions
only
• Single-threaded model

Why Hadoop ?
• Cheaper – scalable to Petabyte/Zetabyte
and more with commodity hardware
• Faster – Parallel Processing
• Better – Suitable for particular types of
‘Big Data’ applications
24

Right Data
• LOB (Line of Business) – not suitable
• Transactional Data
• Behavioral Data -- suitable
• Web usage
• Shopping behavior
• etc
25

Hadoop Applications
• Risk Modeling
• Customer Churn Analysis
• Recommendation Engine
• Ad Targeting
• Transaction Analysis
• Threat Analysis
• Search Quality
26

Who uses Hadoop?
• Facebook
• Yahoo
• Amazon
• eBay
• IBM
• New York Times
• Etc
27

HDFS
• The Hadoop Distributed File System
• For a developer point of view it looks like a
standard file system
• Runs on top of OS file system (extf3,…)
• Designed to store a very large amount of
data (petabytes and so on) and to solve
some problems that comes with DFS or NFS
• Provides fast and scalable access to the
data Stores data reliably 29

HDFS under the hood
• All the files loaded in Hadoop are split into
chunks, called blocks.
• Each block has a fixed size of 64Mb! (newer
version has default block size of 256 Mb).
MyData ~ 150Mb
HDFS
Blk_01
64Mb
Blk_03, 22Mb
Blk_02
64Mb
30

Hadoop cluster
• A Hadoop cluster consist in mainly two
modules:
• A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage
layer)
• A way to process data, the MapReduce
(compute layer).
31

Name Node (Master)
• A dedicated node where all the metadata
of all the files (blocks) inside my system
are stored.
• It’s the directory manager of the HDFS
32

Data Node (Slave)
• A daemon (a service in the Windows
language) running on each cluster nodes.
• Responsible to store the blocks
33

Accessing Data
• To access a file, a client contact the
Namenode to retrieve the list of locations
for the blocks.
• With the locations the client contact the
Datanodes to read the data (possibly in
parallel).
34

Data Redundancy
• Hadoop replicates each block THREE times, as
it’s stored in the HDFS.
• The location of every blocks is managed by
the Namenode
• If a block is under-replicated (due to some
failures on a node), the Namenode is smart
enough to create another replica, until each
node has three replica inside the cluster
• E.g. if we have 100Tb of data to store in
Hadoop, we will need 300Tb of storage
space. 35

Hadoop 2.0: Next-gen platform
40

Hadoop 2.0
• Store all data in one place Interact with
data in multiple ways
42

Hadoop 2.x
• The new Hadoop has now FOUR modules
(instead of two)
• HadoopCommon: common utilities
supporting all the other modules
• HDFS: an evolution of the previous
distributed FS
• Hadoop YARN: a function for job scheduling
and cluster resource management
• Hadoop MapReduce: a YARN based system
for parallel processing of large data sets 43

Hadoop 2.x
• Hadoop v2, leveraging YARN, is aiming to
become the new OS for the data processing
44

Hadoop and real time
• Hadoop v2, using YARN, and Storm (a free
and open source distributed real time
computation system) can compute your
data in real time
• Some Hadoop distribution (like
Hortonworks) are working on an effortless
integration
45

Namenode availability
• If the Namenode fails ALL the cluster
becomes inaccessible
• In the early versions the Namenode was a
single point of failure
• Couple of solution are now available:
• the Namenode stores the data on the
network through NFS
• most production sites have two Namenode:
Active and Standby
49

Hadoop 3.x Features
• Support for Erasure Encoding in HDFS
• YARN Timeline Service v.2
• Shell Script Rewrite
• Shaded Client Jars
• Support for Opportunistic Containers
• MapReduce Task-Level Native
Optimization
50

Hadoop 3.x Features
• Support for More than 2 NameNodes
• Default Ports of Multiple Services have
been Changed
• Support for Filesystem Connector
• Intra-DataNode Balancer
• Reworked Daemon and Task Heap
Management
51

Hadoop Distributions
Open Source Commercial Cloud
Apache
Hadoop
Cloudera AWS
Hortonworks
Microsoft Azure
HDInsight
MapR DataProc
54

Hadoop related projects
• PIG: high level language fro analyzing large
data-sets. It’s working as a compiler that
produce M/R jobs
• HIVE: data warehouse software facilities
querying and managing large data-sets with a
SQL like language
• Hbase : a scalable, distributed database that
supports structured data storage for large
tables
• Cassandra: a scalable multi-master database 55

002 Introduction to hadoop v3

Recommended

More Related Content

What's hot (20)

Similar to 002 Introduction to hadoop v3 (20)

Recently uploaded (20)

002 Introduction to hadoop v3

Editor's Notes