Hadoop basics

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Conﬁguration Starting & Stopping Map Reduc
Big Data & Hadoop
D. Praveen Kumar
Junior Research Fellow
Department of Computer Science & Engineering
Indian Institute of Technology (Indian School of Mines)
Dhanbad, Jharkhand, India
Head of IT & ITES, Skill Subsist Impels Ltd, Tirupati.
March 25, 2017
Sree Venkateswara College of Engineering, Nellore, A. P.
Big Data & Hadoop
March 25, 2017 Slide: 1 / 60

1 Introduction
2 Big Data
3 Sources of Big Data
4 Tools
5 HDFS
6 Installation
7 Conﬁguration
8 Starting & Stopping
9 Map Reduce
10 Execution
Big Data & Hadoop
March 25, 2017 Slide: 2 / 60

Data
Data means a value or set of values.
Examples:
march 1st 2017
20, 30, 40
ΨΦϕ
Big Data & Hadoop
March 25, 2017 Slide: 3 / 60

Information
Meaningful or preprocessed data we called as Information.
Examples:
Big Data & Hadoop
March 25, 2017 Slide: 4 / 60

Data Types
The kind of data that may appear in a computer.
Examples: int
ﬂoat
char
double
Abstract data types -user deﬁned data types.
Big Data & Hadoop
March 25, 2017 Slide: 5 / 60

Traditional approaches
Traditional approaches to store and process the data
1 File system
2 RDBMS (Relational Database Management Systems)
3 Data Warehouse & Mining Tools
4 Grid Computing
5 Volunteer Computing
Big Data & Hadoop
March 25, 2017 Slide: 6 / 60

GUESTS =4
Transportation from railway station to your
home( one Auto/car is suﬃcient)
mom can prepare food or snacks without risk.
Your house is suﬃcient for Accommodation.
Facilities like bed, bathrooms, water and TV are
provided which you use.
You can talk to each other and crack jokes and
you can make them happy
Expenditure is nearly Rs.1000/-
Big Data & Hadoop
March 25, 2017 Slide: 7 / 60

GUESTS =100
Transportation = 25 autos/car or two
buses
Food = catering.
Accommodation = Lodge.
Facilities = AC, TV, and all other facilities
Maintenance= somewhat diﬃcult
Expenditure =nearly Rs. 90,000/-
Big Data & Hadoop
March 25, 2017 Slide: 8 / 60

GUESTS =10000
Transportation = 2500 autos or 500 buses
Food = catering.
Accommodation = all Lodges, function
halls and cottages in the town.
Facilities = AC, TV, and all other
facilities are somewhat diﬃcult to provide.
Maintenance= more diﬃcult
Expenditure =nearly Rs. 2,00,000/-
Big Data & Hadoop
March 25, 2017 Slide: 9 / 60

Grid Computing
Big Data & Hadoop
March 25, 2017 Slide: 10 / 60

Volunteer Computing
Big Data & Hadoop
March 25, 2017 Slide: 11 / 60

GUESTS =10000000
Transportation=how many autos=?
Food =?
Accommodation =?
Facilities =?
Maintenance=?
Cost =?
Big Data & Hadoop
March 25, 2017 Slide: 12 / 60

Problems
Same we assume in computing environment
Difficult to handle a huge and ever growing amount of data
Processing of data can not be possible with few machines
distributing large data sets is difficult
Construction of online or offline models are very difficult
Big Data & Hadoop
March 25, 2017 Slide: 13 / 60

Solution
A single solution to all these problems is
Big Data & Hadoop
March 25, 2017 Slide: 14 / 60

What is Big Data?
Big data refers to voluminous amounts of structured or
unstructured data that organizations can potentially mine and
analyze.
Big data is huge amount of large data sets characterized by
Big Data & Hadoop
March 25, 2017 Slide: 15 / 60

Data generation
Big Data & Hadoop
March 25, 2017 Slide: 16 / 60

How Data generated
Big Data & Hadoop
March 25, 2017 Slide: 17 / 60

Internet of Events
Internet is the main source to generating the wast amount of data.
Big Data & Hadoop
March 25, 2017 Slide: 18 / 60

4 Internet of Events
Big Data & Hadoop
March 25, 2017 Slide: 19 / 60

4 Questions of Data Analysts
1 What happened?
2 Why did it happen?
3 What will happen?
4 What is the best that can happen?
Big Data & Hadoop
March 25, 2017 Slide: 20 / 60

Big Data Platforms and Analytical Software
Big Data & Hadoop
March 25, 2017 Slide: 21 / 60

Hadoop
Here we go with
Big Data & Hadoop
March 25, 2017 Slide: 22 / 60

Hadoop History
Hadoop was created by Doug Cutting, creator of Lucene.
He also involved in a project called Nutch. (It is basic version
of hadoop)
Nutch is a combination of MapReduce and NDFS (Nutch
Distributed File System)
Later Nutch renamed to Hadoop. (Mapreduce + HDFS
(Hadoop Distributed File System))
Big Data & Hadoop
March 25, 2017 Slide: 23 / 60

Hadoop
Apache Hadoop is an open-source software framework for
distributed storage and distributed processing of very large data
sets on computer clusters built from commodity hardware.
Big Data & Hadoop
March 25, 2017 Slide: 24 / 60

Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common contains libraries and utilities needed by
other Hadoop modules
Hadoop Distributed File System (HDFS) a distributed
ﬁle-system that stores data
Hadoop YARN a resource-management platform
Hadoop MapReduce for large scale data processing.
Big Data & Hadoop
March 25, 2017 Slide: 25 / 60

Hadoop Components
Big Data & Hadoop
March 25, 2017 Slide: 26 / 60

Hadoop Components
Big Data & Hadoop
March 25, 2017 Slide: 27 / 60

HDFS- Goals
The design goals of HDFS
1 Very Large ﬁles
2 Streaming Data Access
3 Commodity Hardware
Big Data & Hadoop
March 25, 2017 Slide: 28 / 60

HDFS- Failed in
HDFS is Not FIT for
1 Lots of small files
2 Low latency database access
3 Multiple writers, arbitrary file modifications
Big Data & Hadoop
March 25, 2017 Slide: 29 / 60

HDFS- Concepts
1 Blocks
2 Namenodes
3 Datanodes
4 HDFS Federation
5 HDFS High Availability
Big Data & Hadoop
March 25, 2017 Slide: 30 / 60

Requirements
Necessary
Java >= 7
ssh
Linux OS (Ubuntu >=
14.04)
Hadoop framework
Optional
Eclipse
Internet connection
Big Data & Hadoop
March 25, 2017 Slide: 31 / 60

Java 7 & Installation
Hadoop requires a working Java installation. However, using
java 1.7 or more is recommended.
Following command is used to install java in linux platform
sudo apt-get install openjdk-7-jdk (or)
sudo apt-get install default-jdk
Big Data & Hadoop
March 25, 2017 Slide: 32 / 60

Java PATH Setup
We need to set JAVA path
Open the .bashrc ﬁle located in home directory
gedit ~/.bashrc
Add below line at the end:
export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64
Big Data & Hadoop
March 25, 2017 Slide: 33 / 60

Installation & Conﬁguration of SSH
Hadoop requires SSH(Secure Shell) access to manage its
nodes, i.e. remote machines plus your local machine if you
want to use Hadoop on it.
Install SSH using following command
sudo apt-get install ssh
First, we have to generate DSA an SSH key for user.
ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa
cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys
Big Data & Hadoop
March 25, 2017 Slide: 34 / 60

Download & Extract Hadoop
Download Hadoop from the Apache Download Mirrors
http://mirror.ﬁbergrid.in/apache/hadoop/common/
Extract the contents of the Hadoop package to a location of your
choice. I picked /usr/local/hadoop.
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.2.tar.gz
$ sudo mv hadoop-2.7.2 hadoop
Big Data & Hadoop
March 25, 2017 Slide: 35 / 60

Add Hadoop conﬁguration in .bashrc
Add Hadoop conﬁguration in .bashrc in home directory.
export HADOOP INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP INSTALL/bin
export PATH=$PATH:$HADOOP INSTALL/sbin
export HADOOP MAPRED HOME=$HADOOP INSTALL
export HADOOP HDFS HOME=$HADOOP INSTALL
export HADOOP COMMON HOME=$HADOOP INSTALL
export YARN HOME=$HADOOP INSTALL
export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native
export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib"
Big Data & Hadoop
March 25, 2017 Slide: 36 / 60

Create temp ﬁle, DataNode & NameNode
Execute below commands to create NameNode
mkdir -p /usr/local/hadoopdata/hdfs/namenode
Execute below commands to create DataNode
mkdir -p /usr/local/hadoopdata/hdfs/datanode
Execute below code to create the tmp directory in hadoop
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
Big Data & Hadoop
March 25, 2017 Slide: 37 / 60

Files to Configure
The following are the files we need to configure
core-site.xml
hadoop-env.sh
mapred-site.xml
hdfs-site.xml
Big Data & Hadoop
March 25, 2017 Slide: 38 / 60

Add properties in /usr/local/hadoop/etc/core-site.xml
Add the following snippets between the
< configuration > ... < /configuration > tags in the core-site.xml
file.
Add below property to specify the location of tmp
< property >
< name > hadoop.tmp.dir < /name >
< value > /app/hadoop/tmp < /value >
< /property >
Add below property to specify the location of default file
system and its port number.
< property >
< name > fs.default.name < /name >
< value > hdfs : //localhost : 9000 < /value >
< /property >
Big Data & Hadoop
March 25, 2017 Slide: 39 / 60

Add properties in /usr/local/hadoop/etc/hadoop-env.sh
Un-Comment the JAVA HOME and Give Correct Path For
Java.
export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64
Big Data & Hadoop
March 25, 2017 Slide: 40 / 60

Add property in
/usr/local/hadoop/etc/hadoop/mapred-site.xml
In ﬁle we add The host name and port that the MapReduce job
tracker runs at. Add following in mapred-site.xml :
< property >
< name > mapred.job.tracker < /name >
< value > localhost : 54311 < /value >
< /property >
Big Data & Hadoop
March 25, 2017 Slide: 41 / 60

Add properties in ... etc/hadoop/hdfs-site.xml
In file hdfs-site.xml add following:
Add replication factor
< property >
< name > dfs.replication < /name >
< value > 1 < /value >
< /property >
Specify the NameNode
< property >
< name > dfs.namenode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/namenode < /value >
< /property >
Specify the DataNode
< property >
< name > dfs.datanode.name.dir < /name >
< value > file : /usr/local/hadoopdata/hdfs/datanode < /value >
< /property >
Big Data & Hadoop
March 25, 2017 Slide: 42 / 60

Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is
Formatting the Hadoop file system
We need to do this the first time you set up a Hadoop.
Do not format a running Hadoop filesystem as you will lose all
the data currently in HDFS
To format the filesystem, run the command
hadoop namenode -format
Big Data & Hadoop
March 25, 2017 Slide: 43 / 60

Starting single-node cluster
Run the command:
start-all.sh
This will startup a NameNode,SecondaryNameNode,
DataNode, ResourceManager and a NodeManager on your
machine.
A nifty tool for checking whether the expected Hadoop
processes are running is jps
hadoop1@hadoop1:/usr/local/hadoop$ jps
2598 NameNode
3112 ResourceManager
3523 Jps
2917 SecondaryNameNode
2727 DataNode
3242 NodeManager
Big Data & Hadoop
March 25, 2017 Slide: 44 / 60

Stopping your single-node cluster
Run the command
stop-all.sh
To stop all the daemons running on your machine output will be
like this.
stopping NodeManager
localhost: stopping ResourceManager
stopping NameNode
localhost: stopping DataNode
localhost: stopping SecondaryNameNode
Big Data & Hadoop
March 25, 2017 Slide: 45 / 60

Map-Reduce Framework
Map Reduce programming paradigm
It relies basically on two functions, Map and Reduce
Map Reduce used to manage many large-scale computations
The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
The framework to eﬀectively schedule tasks on the nodes
where data is already present
Big Data & Hadoop
March 25, 2017 Slide: 46 / 60

Map-Reduce Computation Steps
The key-value pairs from each Map task are collected by a
master controller and sorted by key. The keys are divided
among all the Reduce tasks, so all key-value pairs with the
same key wind up at the same Reduce task.
The Reduce tasks work on one key at a time, and combine
all the values associated with that key in some way. The
manner of combination of values is determined by the code
written by the user for the Reduce function.
Big Data & Hadoop
March 25, 2017 Slide: 47 / 60

Hadoop - MapReduce
Big Data & Hadoop
March 25, 2017 Slide: 48 / 60

Hadoop - MapReduce (Word Count) Example
Big Data & Hadoop
March 25, 2017 Slide: 49 / 60

MapReduce - WordCountMapper
In WordCountMapper class we perform the following operations
Read a line from ﬁle
Split line into Words
Assign Count 1 to each word
Big Data & Hadoop
March 25, 2017 Slide: 50 / 60

WordCountMapper source code
public static class WordCountMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Big Data & Hadoop
March 25, 2017 Slide: 51 / 60

MapReduce - WordCountReducer
In WordCountReducer class we perform the following operations
Sum the list of values
Assign sum to corresponding word
Big Data & Hadoop
March 25, 2017 Slide: 52 / 60

WordCountReducer source code
public static class WordCountReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Big Data & Hadoop
March 25, 2017 Slide: 53 / 60

WordCountJob
public class WordCountJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCountJob.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Big Data & Hadoop
March 25, 2017 Slide: 54 / 60

Header Files to include
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
Big Data & Hadoop
March 25, 2017 Slide: 55 / 60

Execution of Hadoop Program in Eclipse
Step1:
1 Starting Hadoop in terminal using command:
$ Start-all.sh
2 Use JPS command to check all services of Hadoop are started
or not.
Step 2: open Eclipse
Step 3: Go to ﬁle ⇒ New ⇒ Project
Select Java Project and click on Next button
Write project name and click on Finish button
Big Data & Hadoop
March 25, 2017 Slide: 56 / 60

Continue...
Step 4: Right side it creates a project
1 Right click on Project ⇒ New ⇒ Class
2 Write Name of Class and then Click Finish
3 Write MapReduce program in that class
Step 5: Write JAVA Program
Big Data & Hadoop
March 25, 2017 Slide: 57 / 60

Continue...
Step 6: Importing JAR ﬁles
1 Right click on Project and select properties (Alt+Enter)
2 Select Java Build Path ⇒ Click on Libraries, then click on add
external JARS
3 Select the following jars from Hadoop library.
/usr/local/Hadoop/share/Hadoop/common/libs
/usr/local/Hadoop/share/Hadoop/hdfs/libs
/usr/local/Hadoop/share/Hadoop/httpfs/libs
/usr/local/Hadoop/share/Hadoop/mapreduce/libs
/usr/local/Hadoop/share/Hadoop/yarn/libs
/usr/local/Hadoop/share/Hadoop/tools/
Big Data & Hadoop
March 25, 2017 Slide: 58 / 60

Continue ....
Step 7: Set input file path
1 Create folder in home dir
2 copy text files in to that
3 Select path of Input
Step 8: Set input and output path
1 right click on source ⇒ Run As ⇒ Run Configuration ⇒
Argument
2 Enter your input and out put path with a single space
3 click on Run
Big Data & Hadoop
March 25, 2017 Slide: 59 / 60

thank You
Big Data & Hadoop
March 25, 2017 Slide: 60 / 60

Hadoop basics

Recommended

More Related Content

What's hot (20)

Similar to Hadoop basics (20)

Recently uploaded (20)

Hadoop basics