SlideShare a Scribd company logo
Performance Optimization tips
Compression 
• Parameter: mapred.compress.map.output: Map 
Output Compression 
• Default: False 
• Pros: Faster disk writes, lower disk space usage, lesser 
time spent on data transfer (from mappers to 
reducers). 
• Cons: Overhead in compression at Mappers and 
decompression at Reducers. 
• Suggestions: For large cluster and large jobs this 
property should be set true. The compression codec 
can also be set through the property 
m a p r e d . m a p . o u t p u t . c o m p r e s si o n . c o d e c (Default is 
o r g . a p a c h e . h a d o o p . i o . c o m p r e s s . D e f a u l t C o d e c )
Speculative Execution 
• Parameter: m a p r e d . m a p / r e d u c e . t a s k s . s p e c u l a t i v e . e x e c u t i o n : 
Enable/Disable task (map/reduce) speculative Execution 
• Default: True 
• Pros: Reduces the job time if the task progress is slow due to 
memory unavailability or hardware degradation. 
• Cons: Increases the job time if the task progress is slow due to 
complex and large calculations. On a busy cluster speculative 
execution can reduce overall throughput, since redundant tasks 
are being executed in an attempt to bring down the execution 
time for a single job. 
• Suggestions: In large jobs where average task completion time 
is significant (> 1 hr) due to complex and large calculations and 
high throughput is required the speculative execution should be 
set to false.
Number of Maps/Reducers 
• Parameter: 
m a p r e d . t a s k t r a c k e r . m a p / r e d u c e . t a s k s . m a x i m u m : 
Maximum tasks (map/reduce) for a tasktracker 
• Default: 2 
• Suggestions: Recommended range - 
(cores_per_node)/2 to 2x(cores_per_node), especially 
for large clusters. This value should be set according to 
the hardware specification of cluster nodes and 
resource requirements of tasks (map/reduce).
File block size 
• Parameter: dfs.block.size: File system block size 
• Default: 67108864 (bytes) 
• Suggestions: 
• Small cluster and large data set: default block size will create a large number 
of map tasks. 
– e.g. Input data size = 160 GB and dfs.block.size = 64 MB then the minimum no. of 
maps= (160*1024)/64 = 2560 maps. 
– If dfs.block.size = 128 MB minimum no. of maps= (160*1024)/128 = 1280 maps. 
– If dfs.block.size = 256 MB minimum no. of maps= (160*1024)/256 = 640 maps. 
• In a small cluster (6-10 nodes) the map task creation overhead is 
considerable. 
• So dfs.block.size should be large in this case but small enough to utilize all 
the cluster resources. 
• The block size should be set according to size of the cluster, map task 
complexity, map task capacity of cluster and average size of input files.
Sort size 
• Parameter: io.sort.mb: Buffer size (MBs) for sorting 
• Default: 100 
• Suggestions: 
• For Large jobs (the jobs in which map output is very large), this 
value should be increased keeping in mind that it will increase 
the memory required by each map task. 
• So the increment in this value should be according to the 
available memory at the node. 
• Greater the value of io.sort.mb, lesser will be the spills to the 
disk, saving write to the disk.
Sort factor 
• Parameter: io.sort.factor: Stream merge factor 
• Default: 10 
• Suggestions: For Large jobs (the jobs in which map output is 
very large and number of maps are also large) which have large 
number of spills to disk, value of this property should be 
increased. 
• The number of input streams (files) to be merged at once in the 
map/reduce tasks, as specified by io.sort.factor, should be set to 
a sufficiently large value (for example, 100) to minimize disk 
accesses. 
• Increment in io.sort.factor, benefits in merging at reducers since 
the last batch of streams (equal to io.sort.factor) are sent to the 
reduce function without merging, thus saving time in merging.
JVM reuse 
• Parameter :m a p r e d . j o b . r e u s e . j v m . n u m . t a s k s : Reuse 
single JVM 
• Default: 1 
• Suggestions: The minimum overhead of JVM creation 
for each task is around 1 second. So for the tasks 
which live for seconds or a few minutes and have 
lengthy initialization, this value can be increased to 
gain performance.
Reduce parallel copies 
• Parameter: m a p r e d . r e d u c e . p a r a l l e l . c o p i e s : Threads 
for parallel copy at reducer. The number of threads 
used to copy map outputs to the reducer 
• Default: 5 
• Suggestions : For Large jobs (the jobs in which map 
output is very large), value of this property can be 
increased keeping in mind that it will increase the total 
CPU usage.
The Other Threads 
• d f s . n a m e n o d e { / m a p r e d . j o b . t r a c k e r } . h a n d l e r . c o u n t :server 
threads that handle remote procedure calls (RPCs) 
– Default: 10 
– Suggestions: This can be increased for larger server (50-64). 
• dfs.datanode.handler.count :server threads that handle remote 
procedure calls (RPCs) 
– Default: 3 
– Suggestions: This can be increased for larger number of HDFS clients (6- 
8). 
• tasktracker.http.threads : number of worker threads on the 
HTTP server on each TaskTracker 
– Default: 40 
– Suggestions: The can be increased for larger clusters (50).
Revelation-Temporary space 
• Temporary space allocation: 
– Jobs which generate large intermediate data (map output) should have 
enough temporary space controlled by property mapred.local.dir. 
– This property specifies list directories where the MapReduce stores 
intermediate data for jobs. 
– The data is cleaned-up after the job completes. 
– By default, replication factor for file storage on HDFS is 3, which means 
that every file has three replicas. 
– As a rule of thumb, at least 25% of the total hard disk should be 
allocated for intermediate temporary output. 
– So effectively, only ¼ hard disk space is available for business use. 
– The default value for mapred.local.dir is 
$ { h a d o o p . t m p . d i r } / m a p r e d / l o c a l . 
– So if mapred.local.dir is not set, hadoop.tmp.dir must have enough space 
to hold job’s intermediate data. 
– If the node doesn’t have enough temporary space the task attempt will 
fail and starts a new attempt, thus impacting the performance.
Java- JVM 
• JVM tuning: 
– Besides normal java code optimizations, JVM settings for each child task also 
affects the processing time. 
– On slave node end, the task tracker and data node use 1 GB RAM each. 
– Effective use of the remaining RAM as well as choosing the right GC mechanism 
for each Map or Reduce task is very important for maximum utilization of 
hardware resources. 
– The default max RAM for child tasks is 200MB which might be insufficient for 
many production grade jobs. 
– The JVM settings for child tasks are governed by mapred.child.java.opts property. 
– Use JDK 1.6 64 BIT 
• + +XX:CompressedOops helpful in dealing with OOM errors 
– Do remember changing Linux open file descriptor: 
• Check: more /proc/sys/fs/file-max 
• Change: vi /etc/sysctl.conf -> fs.file-max = 331287 
• Set: sysctl -p 
– Set java.net.preferIPv4Stack set to true, to avoid timeouts in cases where the 
OS/JVM picks up an IPv6 address and must resolve the hostname.
Logging Is a friend to developers, Foe in production 
• Default - INFO level 
– dfs.namenode.logging.level 
– hadoop.job.history 
– hadoop.logfile.size/count
Static Data strategies 
• Available Approaches 
– JobConf.set(“key”,”value”) 
– Distributed cache 
– HDFS shared file 
• Suggested approaches if above ones not efficient 
– Memcached 
– Tokyocabinet/TokyoTyrant 
– Berkley DB 
– HBase 
– MongoDB
Tuning as suggested by - Arun C Murthy 
• Tell HDFS and Map-Reduce about your network! – Rack locality script: 
topology.script.file.name 
• Number of maps – Data locality 
• Number of reduces – You don’t need a single output file! 
• Amount of data processed per Map - Consider fatter maps, Custom input 
format 
• Combiner - multi-level combiners at both Map and Reduce 
• Check to ensure the combiner is useful! 
• Map-side sort -io.sort.mb, io.sort.factor, io.sort.record.percent, 
io.sort.spill.percent 
• Shuffle 
– Compression for map-outputs – mapred.compress.map.output , 
m a p r e d . m a p . o u t p u t . c o m p r e s si o n . c o d e c , lzo via libhadoop.so, 
tasktracker.http.threads 
– m a p r e d . r e d u c e . p a r a l l e l . c o p i e s , mapred.reduce.copy.backoff, 
m a p r e d . j o b . s h u f f l e . i n p u t . b u f f e r . p e r c e n t , m a p r e d . j o b . s h u f f l e . m e r g e . p e r c e n t , 
mapred.inmem.merge.threshold, m a p r e d . j o b . r e d u c e . i n p u t . b u f f e r . p e r c e n t 
• Compress the job output 
• Miscellaneous -Speculative execution, Heap size for the child, Re-use jvm for 
maps/reduces, Raw Comparators
Anti-Patterns 
• Applications not using a higher-level interface such as Pig unless really 
necessary. 
• Processing thousands of small files (sized less than 1 HDFS block, typically 
128MB) with one map processing a single small file. 
• Processing very large data-sets with small HDFS block size, that is, 128MB, 
resulting in tens of thousands of maps. 
• Applications with a large number (thousands) of maps with a very small 
runtime (e.g., 5s). 
• Straightforward aggregations without the use of the Combiner. 
• Applications with greater than 60,000-70,000 maps. 
• Applications processing large data-sets with very few reduces (e.g., 1).
Anti-Patterns 
• Applications using a single reduce for total-order amount the output 
records. 
• Applications processing data with very large number of reduces, such that 
each reduce processes less than 1-2GB of data. 
• Applications writing out multiple, small, output files from each reduce. 
• Applications using the DistributedCache to distribute a large number of 
artifacts and/or very large artifacts (hundreds of MBs each). 
• Applications using tens or hundreds of counters per task. 
• Applications doing screen scraping of JobTracker web-ui for status of 
queues/jobs or worse, job-history of completed jobs. 
• Workflows comprising hundreds or thousands of small jobs processing small 
amounts of data.
End of session 
Day – 3: Performance Optimization tips
Ad

More Related Content

What's hot (20)

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
MapReduce
MapReduceMapReduce
MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
Leila panahi
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 

Similar to Hadoop performance optimization tips (20)

White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
This gives a brief detail about big data
This  gives a brief detail about big dataThis  gives a brief detail about big data
This gives a brief detail about big data
chinky1118
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
Nik Peric
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Lecture Slide - Introduction to Hadoop, HDFS, MapR.pptLecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
Hadoop
HadoopHadoop
Hadoop
Girish Khanzode
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
Jeff Hammerbacher
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
This gives a brief detail about big data
This  gives a brief detail about big dataThis  gives a brief detail about big data
This gives a brief detail about big data
chinky1118
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
Nik Peric
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Lecture Slide - Introduction to Hadoop, HDFS, MapR.pptLecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Ad

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
Subhas Kumar Ghosh
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Ad

Recently uploaded (20)

Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World ExamplesMastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
jamescantor38
 
The Elixir Developer - All Things Open
The Elixir Developer - All Things OpenThe Elixir Developer - All Things Open
The Elixir Developer - All Things Open
Carlo Gilmar Padilla Santana
 
Robotic Process Automation (RPA) Software Development Services.pptx
Robotic Process Automation (RPA) Software Development Services.pptxRobotic Process Automation (RPA) Software Development Services.pptx
Robotic Process Automation (RPA) Software Development Services.pptx
julia smits
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdfProtect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
株式会社クライム
 
Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025
Web Designer
 
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
Ranking Google
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 
How to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryErrorHow to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryError
Tier1 app
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025
GrapesTech Solutions
 
Adobe Media Encoder Crack FREE Download 2025
Adobe Media Encoder  Crack FREE Download 2025Adobe Media Encoder  Crack FREE Download 2025
Adobe Media Encoder Crack FREE Download 2025
zafranwaqar90
 
Buy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training techBuy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training tech
Rustici Software
 
Wilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For WindowsWilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For Windows
Google
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World ExamplesMastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
Mastering Selenium WebDriver: A Comprehensive Tutorial with Real-World Examples
jamescantor38
 
Robotic Process Automation (RPA) Software Development Services.pptx
Robotic Process Automation (RPA) Software Development Services.pptxRobotic Process Automation (RPA) Software Development Services.pptx
Robotic Process Automation (RPA) Software Development Services.pptx
julia smits
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdfProtect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
株式会社クライム
 
Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025
Web Designer
 
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
!%& IDM Crack with Internet Download Manager 6.42 Build 32 >
Ranking Google
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 
How to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryErrorHow to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryError
Tier1 app
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025Top 12 Most Useful AngularJS Development Tools to Use in 2025
Top 12 Most Useful AngularJS Development Tools to Use in 2025
GrapesTech Solutions
 
Adobe Media Encoder Crack FREE Download 2025
Adobe Media Encoder  Crack FREE Download 2025Adobe Media Encoder  Crack FREE Download 2025
Adobe Media Encoder Crack FREE Download 2025
zafranwaqar90
 
Buy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training techBuy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training tech
Rustici Software
 
Wilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For WindowsWilcom Embroidery Studio Crack 2025 For Windows
Wilcom Embroidery Studio Crack 2025 For Windows
Google
 

Hadoop performance optimization tips

  • 2. Compression • Parameter: mapred.compress.map.output: Map Output Compression • Default: False • Pros: Faster disk writes, lower disk space usage, lesser time spent on data transfer (from mappers to reducers). • Cons: Overhead in compression at Mappers and decompression at Reducers. • Suggestions: For large cluster and large jobs this property should be set true. The compression codec can also be set through the property m a p r e d . m a p . o u t p u t . c o m p r e s si o n . c o d e c (Default is o r g . a p a c h e . h a d o o p . i o . c o m p r e s s . D e f a u l t C o d e c )
  • 3. Speculative Execution • Parameter: m a p r e d . m a p / r e d u c e . t a s k s . s p e c u l a t i v e . e x e c u t i o n : Enable/Disable task (map/reduce) speculative Execution • Default: True • Pros: Reduces the job time if the task progress is slow due to memory unavailability or hardware degradation. • Cons: Increases the job time if the task progress is slow due to complex and large calculations. On a busy cluster speculative execution can reduce overall throughput, since redundant tasks are being executed in an attempt to bring down the execution time for a single job. • Suggestions: In large jobs where average task completion time is significant (> 1 hr) due to complex and large calculations and high throughput is required the speculative execution should be set to false.
  • 4. Number of Maps/Reducers • Parameter: m a p r e d . t a s k t r a c k e r . m a p / r e d u c e . t a s k s . m a x i m u m : Maximum tasks (map/reduce) for a tasktracker • Default: 2 • Suggestions: Recommended range - (cores_per_node)/2 to 2x(cores_per_node), especially for large clusters. This value should be set according to the hardware specification of cluster nodes and resource requirements of tasks (map/reduce).
  • 5. File block size • Parameter: dfs.block.size: File system block size • Default: 67108864 (bytes) • Suggestions: • Small cluster and large data set: default block size will create a large number of map tasks. – e.g. Input data size = 160 GB and dfs.block.size = 64 MB then the minimum no. of maps= (160*1024)/64 = 2560 maps. – If dfs.block.size = 128 MB minimum no. of maps= (160*1024)/128 = 1280 maps. – If dfs.block.size = 256 MB minimum no. of maps= (160*1024)/256 = 640 maps. • In a small cluster (6-10 nodes) the map task creation overhead is considerable. • So dfs.block.size should be large in this case but small enough to utilize all the cluster resources. • The block size should be set according to size of the cluster, map task complexity, map task capacity of cluster and average size of input files.
  • 6. Sort size • Parameter: io.sort.mb: Buffer size (MBs) for sorting • Default: 100 • Suggestions: • For Large jobs (the jobs in which map output is very large), this value should be increased keeping in mind that it will increase the memory required by each map task. • So the increment in this value should be according to the available memory at the node. • Greater the value of io.sort.mb, lesser will be the spills to the disk, saving write to the disk.
  • 7. Sort factor • Parameter: io.sort.factor: Stream merge factor • Default: 10 • Suggestions: For Large jobs (the jobs in which map output is very large and number of maps are also large) which have large number of spills to disk, value of this property should be increased. • The number of input streams (files) to be merged at once in the map/reduce tasks, as specified by io.sort.factor, should be set to a sufficiently large value (for example, 100) to minimize disk accesses. • Increment in io.sort.factor, benefits in merging at reducers since the last batch of streams (equal to io.sort.factor) are sent to the reduce function without merging, thus saving time in merging.
  • 8. JVM reuse • Parameter :m a p r e d . j o b . r e u s e . j v m . n u m . t a s k s : Reuse single JVM • Default: 1 • Suggestions: The minimum overhead of JVM creation for each task is around 1 second. So for the tasks which live for seconds or a few minutes and have lengthy initialization, this value can be increased to gain performance.
  • 9. Reduce parallel copies • Parameter: m a p r e d . r e d u c e . p a r a l l e l . c o p i e s : Threads for parallel copy at reducer. The number of threads used to copy map outputs to the reducer • Default: 5 • Suggestions : For Large jobs (the jobs in which map output is very large), value of this property can be increased keeping in mind that it will increase the total CPU usage.
  • 10. The Other Threads • d f s . n a m e n o d e { / m a p r e d . j o b . t r a c k e r } . h a n d l e r . c o u n t :server threads that handle remote procedure calls (RPCs) – Default: 10 – Suggestions: This can be increased for larger server (50-64). • dfs.datanode.handler.count :server threads that handle remote procedure calls (RPCs) – Default: 3 – Suggestions: This can be increased for larger number of HDFS clients (6- 8). • tasktracker.http.threads : number of worker threads on the HTTP server on each TaskTracker – Default: 40 – Suggestions: The can be increased for larger clusters (50).
  • 11. Revelation-Temporary space • Temporary space allocation: – Jobs which generate large intermediate data (map output) should have enough temporary space controlled by property mapred.local.dir. – This property specifies list directories where the MapReduce stores intermediate data for jobs. – The data is cleaned-up after the job completes. – By default, replication factor for file storage on HDFS is 3, which means that every file has three replicas. – As a rule of thumb, at least 25% of the total hard disk should be allocated for intermediate temporary output. – So effectively, only ¼ hard disk space is available for business use. – The default value for mapred.local.dir is $ { h a d o o p . t m p . d i r } / m a p r e d / l o c a l . – So if mapred.local.dir is not set, hadoop.tmp.dir must have enough space to hold job’s intermediate data. – If the node doesn’t have enough temporary space the task attempt will fail and starts a new attempt, thus impacting the performance.
  • 12. Java- JVM • JVM tuning: – Besides normal java code optimizations, JVM settings for each child task also affects the processing time. – On slave node end, the task tracker and data node use 1 GB RAM each. – Effective use of the remaining RAM as well as choosing the right GC mechanism for each Map or Reduce task is very important for maximum utilization of hardware resources. – The default max RAM for child tasks is 200MB which might be insufficient for many production grade jobs. – The JVM settings for child tasks are governed by mapred.child.java.opts property. – Use JDK 1.6 64 BIT • + +XX:CompressedOops helpful in dealing with OOM errors – Do remember changing Linux open file descriptor: • Check: more /proc/sys/fs/file-max • Change: vi /etc/sysctl.conf -> fs.file-max = 331287 • Set: sysctl -p – Set java.net.preferIPv4Stack set to true, to avoid timeouts in cases where the OS/JVM picks up an IPv6 address and must resolve the hostname.
  • 13. Logging Is a friend to developers, Foe in production • Default - INFO level – dfs.namenode.logging.level – hadoop.job.history – hadoop.logfile.size/count
  • 14. Static Data strategies • Available Approaches – JobConf.set(“key”,”value”) – Distributed cache – HDFS shared file • Suggested approaches if above ones not efficient – Memcached – Tokyocabinet/TokyoTyrant – Berkley DB – HBase – MongoDB
  • 15. Tuning as suggested by - Arun C Murthy • Tell HDFS and Map-Reduce about your network! – Rack locality script: topology.script.file.name • Number of maps – Data locality • Number of reduces – You don’t need a single output file! • Amount of data processed per Map - Consider fatter maps, Custom input format • Combiner - multi-level combiners at both Map and Reduce • Check to ensure the combiner is useful! • Map-side sort -io.sort.mb, io.sort.factor, io.sort.record.percent, io.sort.spill.percent • Shuffle – Compression for map-outputs – mapred.compress.map.output , m a p r e d . m a p . o u t p u t . c o m p r e s si o n . c o d e c , lzo via libhadoop.so, tasktracker.http.threads – m a p r e d . r e d u c e . p a r a l l e l . c o p i e s , mapred.reduce.copy.backoff, m a p r e d . j o b . s h u f f l e . i n p u t . b u f f e r . p e r c e n t , m a p r e d . j o b . s h u f f l e . m e r g e . p e r c e n t , mapred.inmem.merge.threshold, m a p r e d . j o b . r e d u c e . i n p u t . b u f f e r . p e r c e n t • Compress the job output • Miscellaneous -Speculative execution, Heap size for the child, Re-use jvm for maps/reduces, Raw Comparators
  • 16. Anti-Patterns • Applications not using a higher-level interface such as Pig unless really necessary. • Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file. • Processing very large data-sets with small HDFS block size, that is, 128MB, resulting in tens of thousands of maps. • Applications with a large number (thousands) of maps with a very small runtime (e.g., 5s). • Straightforward aggregations without the use of the Combiner. • Applications with greater than 60,000-70,000 maps. • Applications processing large data-sets with very few reduces (e.g., 1).
  • 17. Anti-Patterns • Applications using a single reduce for total-order amount the output records. • Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data. • Applications writing out multiple, small, output files from each reduce. • Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each). • Applications using tens or hundreds of counters per task. • Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs. • Workflows comprising hundreds or thousands of small jobs processing small amounts of data.
  • 18. End of session Day – 3: Performance Optimization tips
  翻译: