SlideShare a Scribd company logo
Page 1Classification: Restricted
Hadoop Developer Training
Session 01 – Introduction to Hadoop & Big
Data
Page 2Classification: Restricted
Agenda
• What is Big Data?
• What is Hadoop?
• Overview of Hadoop Ecosystem
• Hadoop Distributed File System or HDFS
• Hadoop Cluster Modes
• Yarn
• MapReduce
• Hive
• Pig
• Zookeeper
• Flume
• Sqoop
Page 3Classification: Restricted
Big data can be characterized by 3Vs:
• The extreme volume of data.
• The velocity at which the data must be must processed.
• The wide variety of types of data.
 Volume: Size, Amount or Quantity of
Data.
 Velocity: Speed of data.
 Speed at which data must be stored.
 Speed at which data must be
processed.
 Variety: Type of data to be stored or
processed.
 Structured Data
 Unstructured Data
 Semi-Structured Data
What is Big Data?
Page 4Classification: Restricted
Volume , Velocity , Variety
(V3)
Characterization of Big – Data
Page 5Classification: Restricted
 A framework for storing & processing of data using commodity hardware
and storage
We need a system that should support :-
• Distributed Parallel processing
• Built in backup and fail-over mechanism
• Easily scalable and Economical
• Efficient and Reliable
So We Need Hadoop
What Is Hadoop?
Page 6Classification: Restricted
Hadoop Ecosystem Components
Overview to Hadoop System
Page 7Classification: Restricted
The Hadoop Distributed File System, or HDFS.
• HDFS is the storage system for a Hadoop
• When data arrives at the cluster, the HDFS software breaks it into pieces
and distributes those pieces among the different servers participating in
the cluster
• Each server stores just a small fragment of the complete data set
• each piece of data is replicated on more than one serve
Page 8Classification: Restricted
The Hadoop Distributed File System, or HDFS.
Page 9Classification: Restricted
Different modes of hadoop:-
• Standalone Mode
• Pseudo Distributed Mode(Single Node Cluster)
• Fully distributed mode (or multiple node cluster)
Standalone Mode
 Default mode of Hadoop
 HDFS is not utilized in this mode.
 Local file system is used for input and output .
 No Custom Configuration is required in 3 hadoop files
 mapred-site.xml
 core- site.xml
 hdfs-site.xml
 Standalone mode is much faster than Pseudo-distributed mode.
Hadoop Cluster Modes
Page 10Classification: Restricted
Pseudo Distributed Mode(Single Node Cluster)
 Configuration is required in given 3 files for this mode Replication factory is
one for HDFS.
 Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
 Used for Real Code to test in HDFS.
 Pseudo distributed cluster is a cluster where all daemons are Running on
one node itself.
Fully distributed mode (or multiple node cluster)
 This is a Production Phase
 Data are used and distributed across many nodes.
 Different Nodes will be used as Master Node / Data Node / Job
Tracker / Task Tracker
Hadoop Cluster Modes
Page 11Classification: Restricted
Core Components of Hadoop Cluster:-
Hadoop cluster has 3 components:
 Client.
 Master.
 Slave.
The role of each components
are shown in the below image.
Hadoop Cluster – Core Components
Page 12Classification: Restricted
Client:-
It is neither master nor slave, rather play a role of loading the data into cluster,
submit MapReduce jobs describing how the data should be processed and then
retrieve the data to see the response after job completion.
Hadoop Cluster – Core Components
Page 13Classification: Restricted
Masters:-
The Masters consists of 3 components
 NameNode
 Secondary Namenode
 JobTracker.
Hadoop Cluster – Core Components
Page 14Classification: Restricted
Slaves:-
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
 Store the data
 Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to
their masters. The Task Tracker daemon is a slave to the JobTracker and the
DataNode daemon a slave to the NameNode.
Hadoop Cluster – Core Components
Page 15Classification: Restricted
dfs.replication para
meter in the
file hdfs-site.xml.
Equip the Name Node with a highly redundant enterprise class server
configuration; dual power supplies, hot swappable fans, redundant NIC
connections, etc.
Hadoop Cluster – Core Components
Page 16Classification: Restricted
YARN - YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in
MRv1, resource management and job scheduling/ monitoring are split into
separate daemons which are :-
 ResourceManager
 NodeManager
 ApplicationMaster.
Features:-
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
YARN
Page 17Classification: Restricted
• Parallel Job processing framework
• Written in java
• Close integration with HDFS
• Provides :
– Auto partitioning of job into sub tasks
– Auto retry on failures
– Locality of task execution
MapReduce
Page 18Classification: Restricted
• Apache Hive in a few words:
“A data warehouse infrastructure built on top of Apache Hadoop”
• Used for:
– Ad-hoc querying and analyzing large data sets without having to learn
MapReduce
• Main features:
– SQL-like query language called HQL
– Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools
– Support for different storage types such as plain text, HBase, and others
Hive
Page 19Classification: Restricted
Data Access:
Pig -Apache Pig is an abstraction over MapReduce. It is a tool/platform which
is used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as
Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Note :- Pig scripts internally will be converted to map reduce programs.
PIG
Page 20Classification: Restricted
"ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchical name space of data registers"
• Configuration management - machines
• config from a centralized source,
• facilitates simpler deployment/provisioning
• Leader election - a common problem in distributed coordination
• Centralized and highly reliable (simple) data registry
ZOOKEEPER
Page 21Classification: Restricted
Apache Flume - Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large
amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
FLUME
Page 22Classification: Restricted
Sqoop is a tool designed to transfer data between Hadoop and relational
databases. You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS. Four key
features are found in Sqoop:
 Bulk import: Sqoop can import individual tables or entire databases into
HDFS. The data is stored in the native directories and files in the HDFS
file system.
 Data export: Sqoop can export data directly from HDFS into a relational
database using a target table definition based on the specifics of the
target database
SQOOP
Page 23Classification: Restricted
Any Question?
Page 24Classification: Restricted
Thank You!
Ad

More Related Content

What's hot (20)

BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
HBaseCon
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
Edureka!
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Hadoop
HadoopHadoop
Hadoop
Cassell Hsu
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
HBaseCon
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
Edureka!
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 

Similar to Session 01 - Into to Hadoop (20)

Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
KavyaGo
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Module 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptxModule 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Anju
AnjuAnju
Anju
Anju Shekhawat
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 
Cppt
CpptCppt
Cppt
chunkypandey12
 
Cppt
CpptCppt
Cppt
chunkypandey12
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
KavyaGo
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Module 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptxModule 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Ad

More from AnandMHadoop (8)

Overview of Java
Overview of Java Overview of Java
Overview of Java
AnandMHadoop
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
AnandMHadoop
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
AnandMHadoop
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
AnandMHadoop
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
AnandMHadoop
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
AnandMHadoop
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
AnandMHadoop
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
AnandMHadoop
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
AnandMHadoop
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
AnandMHadoop
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
AnandMHadoop
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
AnandMHadoop
 
Ad

Recently uploaded (20)

Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Does Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should KnowDoes Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should Know
Pornify CC
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Does Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should KnowDoes Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should Know
Pornify CC
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 

Session 01 - Into to Hadoop

  • 1. Page 1Classification: Restricted Hadoop Developer Training Session 01 – Introduction to Hadoop & Big Data
  • 2. Page 2Classification: Restricted Agenda • What is Big Data? • What is Hadoop? • Overview of Hadoop Ecosystem • Hadoop Distributed File System or HDFS • Hadoop Cluster Modes • Yarn • MapReduce • Hive • Pig • Zookeeper • Flume • Sqoop
  • 3. Page 3Classification: Restricted Big data can be characterized by 3Vs: • The extreme volume of data. • The velocity at which the data must be must processed. • The wide variety of types of data.  Volume: Size, Amount or Quantity of Data.  Velocity: Speed of data.  Speed at which data must be stored.  Speed at which data must be processed.  Variety: Type of data to be stored or processed.  Structured Data  Unstructured Data  Semi-Structured Data What is Big Data?
  • 4. Page 4Classification: Restricted Volume , Velocity , Variety (V3) Characterization of Big – Data
  • 5. Page 5Classification: Restricted  A framework for storing & processing of data using commodity hardware and storage We need a system that should support :- • Distributed Parallel processing • Built in backup and fail-over mechanism • Easily scalable and Economical • Efficient and Reliable So We Need Hadoop What Is Hadoop?
  • 6. Page 6Classification: Restricted Hadoop Ecosystem Components Overview to Hadoop System
  • 7. Page 7Classification: Restricted The Hadoop Distributed File System, or HDFS. • HDFS is the storage system for a Hadoop • When data arrives at the cluster, the HDFS software breaks it into pieces and distributes those pieces among the different servers participating in the cluster • Each server stores just a small fragment of the complete data set • each piece of data is replicated on more than one serve
  • 8. Page 8Classification: Restricted The Hadoop Distributed File System, or HDFS.
  • 9. Page 9Classification: Restricted Different modes of hadoop:- • Standalone Mode • Pseudo Distributed Mode(Single Node Cluster) • Fully distributed mode (or multiple node cluster) Standalone Mode  Default mode of Hadoop  HDFS is not utilized in this mode.  Local file system is used for input and output .  No Custom Configuration is required in 3 hadoop files  mapred-site.xml  core- site.xml  hdfs-site.xml  Standalone mode is much faster than Pseudo-distributed mode. Hadoop Cluster Modes
  • 10. Page 10Classification: Restricted Pseudo Distributed Mode(Single Node Cluster)  Configuration is required in given 3 files for this mode Replication factory is one for HDFS.  Here one node will be used as Master Node / Data Node / Job Tracker / Task Tracker  Used for Real Code to test in HDFS.  Pseudo distributed cluster is a cluster where all daemons are Running on one node itself. Fully distributed mode (or multiple node cluster)  This is a Production Phase  Data are used and distributed across many nodes.  Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker Hadoop Cluster Modes
  • 11. Page 11Classification: Restricted Core Components of Hadoop Cluster:- Hadoop cluster has 3 components:  Client.  Master.  Slave. The role of each components are shown in the below image. Hadoop Cluster – Core Components
  • 12. Page 12Classification: Restricted Client:- It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the data should be processed and then retrieve the data to see the response after job completion. Hadoop Cluster – Core Components
  • 13. Page 13Classification: Restricted Masters:- The Masters consists of 3 components  NameNode  Secondary Namenode  JobTracker. Hadoop Cluster – Core Components
  • 14. Page 14Classification: Restricted Slaves:- Slave nodes are the majority of machines in Hadoop Cluster and are responsible to  Store the data  Process the computation Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a slave to the NameNode. Hadoop Cluster – Core Components
  • 15. Page 15Classification: Restricted dfs.replication para meter in the file hdfs-site.xml. Equip the Name Node with a highly redundant enterprise class server configuration; dual power supplies, hot swappable fans, redundant NIC connections, etc. Hadoop Cluster – Core Components
  • 16. Page 16Classification: Restricted YARN - YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1, resource management and job scheduling/ monitoring are split into separate daemons which are :-  ResourceManager  NodeManager  ApplicationMaster. Features:- • Better resource management. • Scalability • Dynamic allocation of cluster resources. YARN
  • 17. Page 17Classification: Restricted • Parallel Job processing framework • Written in java • Close integration with HDFS • Provides : – Auto partitioning of job into sub tasks – Auto retry on failures – Locality of task execution MapReduce
  • 18. Page 18Classification: Restricted • Apache Hive in a few words: “A data warehouse infrastructure built on top of Apache Hadoop” • Used for: – Ad-hoc querying and analyzing large data sets without having to learn MapReduce • Main features: – SQL-like query language called HQL – Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools – Support for different storage types such as plain text, HBase, and others Hive
  • 19. Page 19Classification: Restricted Data Access: Pig -Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data. To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. Salient features of pig: • Ease of programming • Optimization opportunities • Extensibility. Note :- Pig scripts internally will be converted to map reduce programs. PIG
  • 20. Page 20Classification: Restricted "ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers" • Configuration management - machines • config from a centralized source, • facilitates simpler deployment/provisioning • Leader election - a common problem in distributed coordination • Centralized and highly reliable (simple) data registry ZOOKEEPER
  • 21. Page 21Classification: Restricted Apache Flume - Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Features: • Robust • Fault tolerant • Simple and flexible Architecture based on streaming data flows. FLUME
  • 22. Page 22Classification: Restricted Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Four key features are found in Sqoop:  Bulk import: Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.  Data export: Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database SQOOP
  翻译: