HDFS in a nutshell

HDFS in a nutshell

Main characteristics and features of HDFS

The original article is available on my personal web site:

Introduction

In this article, I am going to introduce you the main characteristics and features of HDFS. If you haven’t heard about HDFS, you should know that it is a highly available, scalable distributed filesystem that is capable of storing a huge quantity of data. Its name stands for Hadoop Distributed File System (it is, in fact, officially included in Hadoop) and is an optimal choice for applications accessing and processing “big data” data sets.

The main characteristics of HDFS are, in fact:

  • Data Reliability
  • High availability (Fault tolerance)
  • High throughput and scalability

It’s not that usual to find all these characteristics in a single data organisation, though. So, how can this filesystem achieve such ambitious and apparently complementary goals? I will give you an overview of HDFS main concepts to show how that can be obtained by this fascinating filesystem.

Overview of HDFS

No alt text provided for this image

First of all, let’s define how HDFS can be classified within the categories of filesystems. HDFS is an object storage: this means that it can be used to store objects that do not change over time. A file stored on HDFS cannot, in fact, be edited in place. In case a modification is required, a new version of the object must be stored. In this way, the old version will be entirely replaced. This paradigm is especially useful for applications that follow the WORM framework (Write Once Read Many times). These applications are designed to write a huge amount of data once and access it in reading mode several times without modifying the original objects.

Components of HDFS

No alt text provided for this image

The organisation of HDFS relies on two main separate components:

  • Namenodes
  • Datanodes

These are two distinct services that make up an HDFS implementation. Every service has its own specific responsibilities that concur to provide the full capabilities of the filesystem.

In particular, the two components have the following functions in HDFS:

  • Namenode services run on the master nodes of the cluster. They are used to store the metadata of the filesystem: in other words, they keep track of the information about the data itself and about the nodes that compose the HDFS cluster. The medatada are usually stored both in memory to be quickly accessed and on persistent storage for reliability purposes in case of restart of the Namenodes.
  • Datanode services, in their turn, run on the slave nodes and are responsible for storing the actual data of the cluster. This is where the real storage of your objects takes place.

You should see the master nodes, those hosting the Namenode services, as the orchestrators of an HDFS cluster. They maintain and manage the slave nodes and assign tasks to them. Master nodes also execute filesystem namespace operations like opening, closing, and renaming files and directories. Every master node should be deployed on reliable hardware, in force of its very sensible role in an HDFS cluster.

For what concerns slave nodes, there can be a very high number of them in an HDFS cluster to actually manage data storage. Datanode services running on slave nodes, in fact, do the tasks of serving read and write requests coming from the clients. They also perform block creation, deletion, and replication upon instruction from the Namenode. These blocks are the chunks into which every file is divided; every block is written on a Datanode and then replicated to some other Datanodes, according to the number of replicas defined for the cluster. Thanks to this replication of data that leads to redundancy, Datanodes can be deployed on commodity hardware that can be easily replaced without affecting the cluster operations.

Characteristics of HDFS

No alt text provided for this image

We have detailed so far some of the main features of HDFS:

  • The separation between master and slave nodes
  • The division of every single file into blocks of predefined, fixed length
  • The replication of every single block across a set number of Datanodes

These features are capable of ensuring the goals of HDFS. Let’s analyse, in fact, the characteristics of HDFS one by one to see how they all contribute to making this distributed filesystem suitable for the purpose it was designed for.

Data Reliability

The fact that every block composing a single file is replicated across different Datanodes ensures that the failure of a certain number of slave nodes does not affect the availability of the data stored in the cluster. For instance, assuming the default replication factor of 3, it can happen that two nodes hosting a block of a file fail without affecting the availability of the file itself. The affected block can be, in fact, retrieved from the survival node.

In case of the failure of a node, the master Namenode has the task of rebalancing the cluster. This means that the replication process will be triggered for the blocks that fall below the replication factor until the desired replication threshold is restored.

Similarly, in case a failed node rejoins the cluster, the replication factor of the blocks it hosts will become higher than desired. In this case, the master Namenode will take care of deleting the useless blocks, thus rebalancing the cluster.

High availability (Fault tolerance)

The high availability of an HDFS cluster relies on the redundancy of the nodes that compose it. We have just seen that there are several Datanodes in the cluster and that the block stored on them are replicated to ensure fault tolerance. In case of failure of a Datanode, the client can read data from another one that stores the same information. This is even possible in case of failures that occur while reading data on the fly. The client simply has to connect to another Datanode and restart the reading process from the affected block.

Another point of failure in a cluster is represented by the Namenode components. These are, in fact, extremely important for the functions of HDFS because they have a fundamental role in ensuring the good state of the cluster and because they store the filesystem metadata.

Starting from Hadoop 2.0.0, the reliability of the master nodes has been improved by passing to an active/passive multi-master scenario. In fact, in an HDFS cluster, two Namenodes can be present, an active one that handles all client operations and a passive one that stores metadata as well and is always in sync with the active Namenode. The passive Namenode maintains an up to date state to provide a fast failover in case of failure of the active Namenode. The HDFS cluster ensures that only one Namenode is active at every time, otherwise, having more than one active Namenode will lead to a corruption of the data. HDFS provides fencing strategies to avoid a “Split-Brain” scenario where more than one Namenode acts as if it were active.

High throughput and scalability

The distributed and replicated structure of HDFS clusters is capable of providing high throughput even when a large number of requests are served by the filesystem. The nature of HDFS, with the organisation into blocks that are stored in different nodes, grants the possibility to parallelise multiple concurrent I/O operations. It is easy, in fact, for the client to start multiple data streams from different Datanodes, acquire several blocks at once and rebuild the correct structure of the file when all the blocks have been received.

The distributed nature of HDFS is also helpful to reduce network traffic to a minimum. The strategy is to bring the processing activities as close as possible to the data, not the other way round.

An HDFS cluster can also scale very easily in case more space or a higher throughput is required. It is just a matter of adding Datanodes to the cluster. These slave nodes will announce themselves to the master Namenode and will be immediately usable without any effort.

These characteristics contribute to make HDFS an optimal choice for WORM applications: software structured to write a huge amount of data that will not be modified, but will be read many times by several different processes.

Read and write operations in details

No alt text provided for this image

As explained above, the data that make up every file in HDFS is split into chunks equal in size, called blocks. Therefore, read/write operations operate at a block level: the client will initially agree with the Namenode where to read or write blocks and then start a data stream with the Datanodes to which the storage of blocks is delegated. It will be at this point responsibility of the Datanodes to activate the process of block replication in the cluster.

The details of how every I/O process works are explained below.

Read operations

These are the phases of a read operation in HDFS:

  1. A client initiates a read request by calling the ‘open()’ method of FileSystem object; the FileSystem object is a descendant of the type DistributedFileSystem.
  2. This object connects to the Namenode using RPC and retrieves metadata information. This includes the addresses of the first few blocks of a file.
  3. The addresses of the Datanodes having a copy of that block are returned back.
  4. An object of type FSDataInputStream is returned to the client. The FSDataInputStream object contains a DFSInputStream that takes care of the interactions with the Datanode and the Namenode. The client invokes the ‘read()’ method that causes the DFSInputStream object to establish a connection with the first Datanode.
  5. As the client invokes the ‘read()’ method repeatedly, the data is read in the form of streams. This process continues until it reaches the end of the block.
  6. The DFSInputStream object closes the connection and moves on to locate the next Datanode for the next block.
  7. The client calls the ‘close()’ method.

Write operations

This is, instead, how a write operation takes place in HDFS:

  1. A client initiates a write operation by calling the ‘create()’ method of the DistributedFileSystem object.
  2. The DistributedFileSystem object connects to the Namenode using an RPC call and initiates the creation of the new file. The Namenode verifies that the file does not exist already and that the client has the correct permissions to create the new file. If the operation succeeds, a new metadata record is created by the Namenode.
  3. An object of type FSDataOutputStream is returned to the client.
  4. The FSDataOutputStream object contains the DFSOutputStream object that looks after the communication between the Datanodes and the Namenode. The packets sent by the client are enqueued into a DataQueue.
  5. A DataStreamer object consumes this DataQueue. The DataStreamer object also asks the Namenode for the allocation of new blocks.
  6. The replication process creates a pipeline across the Datanodes that will store the blocks.
  7. The DataStreamer object writes packets into the first Datanode in the pipeline.
  8. Every Datanode in the pipeline stores the packets received and forwards the same packets to the second Datanode. The process is repeated in a cascade until the number of Datanodes specified by the replication factor is reached.
  9. The ‘Ack Queue’ is maintained by the DFSOutputStream object in a way to store into it the packets waiting for the acknowledgement from the Datanodes.
  10. Once all the Datanodes in the pipeline have acknowledged a packet, it is removed from the ‘Ack Queue’. In the event of any failure, the packets are requeued.
  11. After a client has written the data, it calls the ‘close()’ method. Remaining data packets are flushed and acknowledged.
  12. The Namenode is contacted to inform it that the file write operation has been completed.

Security in I/O operations

Please note that, for the sake of simplicity, the above lists do not include the security checks necessary to determine if the client is authorised to perform the requested operation.

It is the Namenode that checks the client authorisation. In case of granted access, the Namenode itself will give a security token to the client. This token will be used during all the I/O process and will be sent to both the Namenode and the Datanodes at every communication to prove that the client can access the affected filesystem resources.

Conclusions

This article made you able to understand the foundations of HDFS: an object filesystem that is highly available, scalable, distributed and capable of storing a huge quantity of data.

You found that the main characteristics of HDFS are data reliability, high availability (fault tolerance), high throughput and scalability.

In order to define these characteristics, you have encountered the main features of HDFS: the separation between master and slave nodes, the division of every single file into blocks and the replication of every single block across a set number of Datanodes.

All these notions put together made you able to list and explain the major features of the HDFS filesystem.

At the end of the article, you also went through the process that lies behind the read and write operations. This last part made you also able to understand how the basic filesystem functions are inserted in the framework of the HDFS structure and organisation.

To view or add a comment, sign in

More articles by Andrea De Rinaldis

Insights from the community

Others also viewed

Explore topics