HDFS: Optimization, Stabilization and Supportability

HDFS: Optimization, Stabilization and
Supportability
April 13, 2016
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth

© Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data

Motivation
• HDFS engineers are on the front line for operational support of Hadoop.
– HDFS is the foundational storage layer for typical Hadoop deployments.
– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.
– Conversely, application problems can become visible at the layer of HDFS operations.
• Analysis of Hadoop Support Cases
– Support case trends reveal common patterns for HDFS operational challenges.
– Those challenges inform what needs to improve in the software.
• Software Improvements
– Optimization: Identify bottlenecks and make them faster.
– Stabilization: Prevent unusual circumstances from harming cluster uptime.
– Supportability: When something goes wrong, provide visibility and tools to fix it.
Thank you to the entire community of Apache contributors.
Page 3

Logging
• Logging requires a careful balance.
– Too little logging hides valuable operational information.
– Too much logging causes information overload, increased load and greater garbage collection overhead.
• Logging APIs
– Hadoop codebase currently uses a mix of logging APIs.
– Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages.
if (LOG.isDebugEnabled()) {
LOG.debug(“Processing block: “ + block); // expensive toString() implementation!
}
– SLF4J simplifies this.
LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled
• Pitfalls
– Forgotten guard logic.
– Logging in a tight loop.
– Logging while holding a shared resource, such as a mutually exclusive lock.
Page 4

HADOOP-12318: better logging of LDAP exceptions
• Failure to log full details of an authentication failure.
– Very simple patch, huge payoff.
– Include exception details when logging failure.
• Before:
throw new SaslException("PLAIN auth failed: " + e.getMessage());
• After:
throw new SaslException("PLAIN auth failed: " + e.getMessage(), e);
Page 5

HDFS-9434: Recommission a datanode with 500k blocks
may pause NN for 30 seconds
• Logging is too verbose
– Summary of patch: don’t log too much!
– Move detailed logging to trace level.
– It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations.
• Before:
LOG.info("BLOCK* processOverReplicatedBlock: " +
"Postponing processing of over-replicated " +
block + " since storage + " + storage
+ "datanode " + cur + " does not yet have up-to-date " +
"block information.");
• After:
if (LOG.isTraceEnabled()) {
LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block
+ " since storage " + storage
+ " does not yet have up-to-date information.");
}
Page 6

Troubleshooting
• Kerberos is hard.
– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.
– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration.
– When it doesn’t work, finding root cause is challenging.
• Metrics are vital for diagnosis of most operational problems.
– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)
– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls)
Page 7

HADOOP-12426: kdiag
• Kerberos misconfiguration diagnosis.
– Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems.
– DNS
– Hadoop configuration files
– KDC configuration
• kdiag: a command-line tool for diagnosis of Kerberos problems
– Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug.
– Prints various environment variables, Java system properties and Hadoop configuration options related to
security.
– Attempt a login.
– If keytab used, print principal information from keytab.
– Print krb5.conf.
– Validate kinit executable (used for ticket renewals).
Page 8

HDFS-6982: nntop
• Find activity trends of HDFS operations.
– HDFS audit log contains a record of each file system operation to the NameNode.
– NameNode metrics contain raw counts of operations.
– Identifying load trends from particular users or particular operations has always required ad-hoc scripting to
analyze the above sources of information.
• nntop: HDFS operation counts aggregated per operation and per user within time windows.
– curl
'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’
– Look for the “TopUserOpCounts” section in the returned JSON.
"ops": [
{
"totalCount": 1,
"opType": "delete",
"topUsers": [
{
"count": 1,
"user": "chris"
}
Page 9

HDFS-7182: JMX metrics aren't accessible when NN is
busy
• Lock contention while attempting to query NameNode JMX metrics.
– JMX metrics are often queried in response to operational problems.
– Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then
metrics could not be accessed.
– During times of high load, the lock is likely to be held by another thread.
– At a time when the metrics are most likely to be needed, they were inaccessible.
– This patch addressed the problem by acquiring the metrics data without requiring the lock held.
Page 10

Managing Load
• RPC call load.
– It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load.
– RPC servers accept calls into a single shared queue.
– Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient
job that caused the problem.
– Load problems can be mitigated with enhanced admission control, client back-off and throttling policies
tailored to real-world usage patterns.
Page 11

HADOOP-10282: FairCallQueue
• Hadoop RPC Architecture
– Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue.
– Worker threads consume the incoming calls from that shared queue and process them.
– In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available.
– At the extreme, the queue overflows, which then requires rejecting the calls.
– This tends to punish all callers, not just the caller that triggered the unusually high load.
• RPC Congestion Control with FairCallQueue
– Replace single shared queue with multiple prioritized queues.
– Call is placed into a queue with priority selected based on the calling user’s current history.
– Calls are dequeued and processed with greater frequency from higher-priority queues.
– Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the
original architecture.
– Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other
processes to make progress. There is less risk of a single runaway job overwhelming a cluster.
Page 12

HADOOP-10597: RPC Server signals backoff to clients
when all request queues are full
• Client-side backoff from overloaded RPC servers.
– Builds upon work of the RPC FairCallQueue.
– If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff.
– Clients are aware of the signal, and react by performing exponential backoff before sending additional calls.
– Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will
instead succeed, but with longer latency.
– Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up.
Page 13

HADOOP-12916: Allow RPC scheduler/callqueue backoff
using response times
• More flexibility in back-off policies.
– Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe.
– Instead, track call response time, and trigger backoff when response time exceeds bounds.
– Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can
prevent the problem from becoming so severe that the queue overflows.
Page 14

Performance
• Garbage Collection
– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).
– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are
larger, therefore the memory footprint has increased for tracking block state.)
– Much has been written about garbage collection tuning for large heap JVM processes.
– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage
collection pressure.
• Block Reporting
– The process by which DataNodes report information about their stored blocks to the NameNode.
– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.
– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.
– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency
directly.
– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user
operations sufficiently.
Page 15

HDFS-7097: Allow block reports to be processed during
checkpointing on standby name node
• Coarse-grained locking impedes block report processing.
– NameNode has a global lock required to enforce mutual exclusion for some operations.
– One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage
representing the full metadata state and beginning a new edit log. This can take a long time in large clusters.
– Block report processing also required holding the lock, and therefore could not proceed during a checkpoint.
• Coarse-grained lock contention can lead to cascading failure and downtime.
– Checkpointing holds lock.
– Frequent incremental block reports from DataNodes block waiting to acquire lock.
– Eventually consumes all available RPC handler threads, all waiting to acquire lock.
– In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the
failover request.
– Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead,
because their blocked heartbeats couldn’t be processed.
• Solution: allow block report processing without holding global lock.
– Block reports now can be processed concurrently with a checkpoint in progress.
– Like most multi-threading and locking logic, required careful reasoning to ensure change was safe.
Page 16

HDFS-7435: PB encoding of block reports is very inefficient
• Block report RPC message encoding can cause memory allocation inefficiency and garbage
collection churn.
– HDFS RPC messages are encoded using Protocol Buffers.
– Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field.
– Behind the scenes, this becomes an ArrayList with a default capacity of 10.
– DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost
guaranteed.
– Data type contained in the ArrayList is Long (note captialization, not primitive long).
– Boxing and unboxing causes additional allocation requirements.
• Solution: a more GC-friendly encoding of block reports.
– Within the Protocol Buffers RPC message, take over serialization directly.
– Manually encode number of longs, followed by list of primitive longs.
– Eliminates ArrayList reallocation costs.
– Eliminates boxing and unboxing costs by deserializing straight to primitive long.
Page 17

HDFS-7609: Avoid retry cache collision when Standby
NameNode loading edits
• Idempotence and at-most-once delivery of HDFS RPC messages.
– Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still
the same. Example: setPermission.
– Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing
guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename.
– The data structure that does this is called the RetryCache.
– This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send
the same message more than once.
• Erroneous multiple RetryCache entries for same operation.
– Duplicate entries caused slowdown.
– Particularly noticeable during an HA transition.
– Bug fix to prevent duplicate entries.
Page 18

HDFS-9710: Change DN to send block receipt IBRs in
batches
• Incremental block reports trigger multiple RPC calls.
– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.
– Even multiple block receipts translate to multiple individual incremental block report RPCs.
– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the
NameNode to process.
• Solution: batch multiple block receipt events into a single RPC message.
– Reduces RPC overhead of sending multiple messages.
– Scales better with respect to number of nodes and number of blocks in a cluster.
Page 19

Liveness
• "...make progress despite the fact that its concurrently executing components ("processes") may
have to "take turns" in critical sections, parts of the program that cannot be simultaneously run
by multiple processes." -Wikipedia
• DataNode Heartbeats
– Responsible for reporting health of a DataNode to the NameNode.
– Operational problems of managing load and performance can block timely heartbeat processing.
– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and
asynchronous dispatch of commands (e.g. delete block).
• Blocked heartbeat processing can cause cascading failure and downtime.
– Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and
therefore are not running.
– DataNodes that stop running are flagged by the NameNode as dead.
– Too many dead DataNodes makes the cluster inoperable as a whole.
– Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements.
– Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity.
Page 20

HDFS-9239: DataNode Lifeline Protocol: an alternative
protocol for reporting DataNode health
• The lifeline keeps the DataNode alive, despite conditions of unusually high load.
– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by
DataNodes.
– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for
asynchronous command dispatch, and therefore do not need to contend on a shared lock.
– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.
– Prevents erroneous and costly re-replication activity.
Page 21

HDFS-9311: Support optional offload of NameNode HA
service health checks to a separate RPC server.
• RPC offload of HA health check and failover messages.
– Similar to problem of timely heartbeat message delivery.
– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the
NameNode.
– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.
– A NameNode overwhelmed with unusually high load cannot process these messages.
– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged
outage period.
– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the
case of unusually high load.
Page 22

Optimizing Applications
• HDFS Utilization Patterns
– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.
– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.
Page 23

HIVE-10223: Consolidate several redundant FileSystem
API calls.
• Hadoop FileSystem API can cause applications to make redundant RPC calls.
• Before:
if (fs.isFile(file)) { // RPC #1
...
} else if (fs.isDirectory(file)) { // RPC #2
...
}
• After:
FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC
if (fileStatus.isFile()) { // Local, no RPC
...
} else if (fileStatus.isDirectory()) { // Local, no RPC
...
}
• Good for Hive, because it reduces latency associated with NameNode RPCs.
• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.
Page 24

PIG-4442: Eliminate redundant RPC call to get file
information in HPath.
• A similar story of redundant RPC within Pig code.
• Before:
long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1
short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2
• After:
FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC
long blockSize = fileStatus.getBlockSize(); // Local, no RPC
short replication = fileStatus.getReplication(); // Local, no RPC
• Revealed from inspection of HDFS audit log.
– HDFS audit log shows a record of each file system operation executed against the NameNode.
– This continues to be one of the most significant sources of HDFS troubleshooting information.
– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a
Pig job submission.
Page 25

HDFS-9924: Asynchronous HDFS Access
• Current Hadoop FileSystem API is inherently synchronous.
– Issue a single synchronous file system call.
– In the case of HDFS, that call is implemented with a synchronous RPC.
– Block waiting for the result.
– Then, client application may proceed.
• Some application usage patterns would benefit from asynchronous access.
– Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies
between the results of those calls.
– For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename
can execute independently, with no data dependencies on the results of other renames.
public Future<Boolean> rename(Path src, Path dst) throws IOException;
Page 26

Summary
• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational
storage layer of the Hadoop ecosystem.
• Optimization
– Performance
– Optimizing Applications
• Stabilization
– Liveness
– Managing Load
• Supportability
– Logging
– Troubleshooting
Page 27

Thank you!
Q&A

HDFS: Optimization, Stabilization and Supportability

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HDFS: Optimization, Stabilization and Supportability (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

HDFS: Optimization, Stabilization and Supportability

Editor's Notes