Session 01 - Into to Hadoop

Classification: Restricted
Hadoop Developer Training
Session 01 – Introduction to Hadoop & Big
Data

Agenda
• What is Big Data?
• What is Hadoop?
• Overview of Hadoop Ecosystem
• Hadoop Distributed File System or HDFS
• Hadoop Cluster Modes
• Yarn
• MapReduce
• Hive
• Pig
• Zookeeper
• Flume
• Sqoop

Big data can be characterized by 3Vs:
• The extreme volume of data.
• The velocity at which the data must be must processed.
• The wide variety of types of data.
 Volume: Size, Amount or Quantity of
Data.
 Velocity: Speed of data.
 Speed at which data must be stored.
 Speed at which data must be
processed.
 Variety: Type of data to be stored or
processed.
 Structured Data
 Unstructured Data
 Semi-Structured Data
What is Big Data?

Volume , Velocity , Variety
(V3)
Characterization of Big – Data

 A framework for storing & processing of data using commodity hardware
and storage
We need a system that should support :-
• Distributed Parallel processing
• Built in backup and fail-over mechanism
• Easily scalable and Economical
• Efficient and Reliable
So We Need Hadoop
What Is Hadoop?

Hadoop Ecosystem Components
Overview to Hadoop System

The Hadoop Distributed File System, or HDFS.
• HDFS is the storage system for a Hadoop
• When data arrives at the cluster, the HDFS software breaks it into pieces
and distributes those pieces among the different servers participating in
the cluster
• Each server stores just a small fragment of the complete data set
• each piece of data is replicated on more than one serve

The Hadoop Distributed File System, or HDFS.

Different modes of hadoop:-
• Standalone Mode
• Pseudo Distributed Mode(Single Node Cluster)
• Fully distributed mode (or multiple node cluster)
Standalone Mode
 Default mode of Hadoop
 HDFS is not utilized in this mode.
 Local file system is used for input and output .
 No Custom Configuration is required in 3 hadoop files
 mapred-site.xml
 core- site.xml
 hdfs-site.xml
 Standalone mode is much faster than Pseudo-distributed mode.
Hadoop Cluster Modes

Pseudo Distributed Mode(Single Node Cluster)
 Configuration is required in given 3 files for this mode Replication factory is
one for HDFS.
 Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
 Used for Real Code to test in HDFS.
 Pseudo distributed cluster is a cluster where all daemons are Running on
one node itself.
Fully distributed mode (or multiple node cluster)
 This is a Production Phase
 Data are used and distributed across many nodes.
 Different Nodes will be used as Master Node / Data Node / Job
Tracker / Task Tracker
Hadoop Cluster Modes

Core Components of Hadoop Cluster:-
Hadoop cluster has 3 components:
 Client.
 Master.
 Slave.
The role of each components
are shown in the below image.
Hadoop Cluster – Core Components

Client:-
It is neither master nor slave, rather play a role of loading the data into cluster,
submit MapReduce jobs describing how the data should be processed and then
retrieve the data to see the response after job completion.

Masters:-
The Masters consists of 3 components
 NameNode
 Secondary Namenode
 JobTracker.

Slaves:-
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
 Store the data
 Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to
their masters. The Task Tracker daemon is a slave to the JobTracker and the
DataNode daemon a slave to the NameNode.

dfs.replication para
meter in the
file hdfs-site.xml.
Equip the Name Node with a highly redundant enterprise class server
configuration; dual power supplies, hot swappable fans, redundant NIC
connections, etc.

YARN - YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in
MRv1, resource management and job scheduling/ monitoring are split into
separate daemons which are :-
 ResourceManager
 NodeManager
 ApplicationMaster.
Features:-
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
YARN

• Parallel Job processing framework
• Written in java
• Close integration with HDFS
• Provides :
– Auto partitioning of job into sub tasks
– Auto retry on failures
– Locality of task execution
MapReduce

• Apache Hive in a few words:
“A data warehouse infrastructure built on top of Apache Hadoop”
• Used for:
– Ad-hoc querying and analyzing large data sets without having to learn
MapReduce
• Main features:
– SQL-like query language called HQL
– Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools
– Support for different storage types such as plain text, HBase, and others
Hive

Data Access:
Pig -Apache Pig is an abstraction over MapReduce. It is a tool/platform which
is used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as
Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Note :- Pig scripts internally will be converted to map reduce programs.
PIG

"ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchical name space of data registers"
• Configuration management - machines
• config from a centralized source,
• facilitates simpler deployment/provisioning
• Leader election - a common problem in distributed coordination
• Centralized and highly reliable (simple) data registry
ZOOKEEPER

Apache Flume - Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large
amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
FLUME

Sqoop is a tool designed to transfer data between Hadoop and relational
databases. You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS. Four key
features are found in Sqoop:
 Bulk import: Sqoop can import individual tables or entire databases into
HDFS. The data is stored in the native directories and files in the HDFS
file system.
 Data export: Sqoop can export data directly from HDFS into a relational
database using a target table definition based on the specifics of the
target database
SQOOP

Any Question?

Thank You!

Session 01 - Into to Hadoop

Recommended

More Related Content

What's hot (20)

Similar to Session 01 - Into to Hadoop (20)

More from AnandMHadoop (8)

Recently uploaded (20)

Session 01 - Into to Hadoop