Consistent Hashing

Consistent Hashing

When dealing with distributed systems, we often require data partitioning which refers to distributing the data across different nodes.

Data partitioning comes up with its own challenges.

  1. How to keep track of a node where a particular data is being stored?
  2. If a node is added or removed how we can track which data moved from current nodes to new nodes?

Can simple hashing solve the problem?

Whenever we come across problems where we need to find the information which belongs to a particular key, the first thing which comes to our mind is hashing. To find which node a particular data belongs we can simply hash (using a suitable hash function like MD5) the data key to get a number. To find which node it belongs to we can simply apply modulo to the number generated by the hash key and number of nodes. This way we can address the first challenge. The problematic situation arises when we add or remove the node. When we add/remove the node i.e. change the total number of nodes, the node value is obtained by applying modulo changes thus implying the remapping of all the keys and moving data to the nodes. This is a big issue where the number of nodes keeps on changing depending on the load.

Consider data key as Employee ID and hash function 'h'

Total Number of Nodes = N

h(Employee ID) -> A number

Node Value = h(Employee ID) % N        

Consistent Hashing to the rescue!!

The main motivation to implement this solution was to minimize the number of data movements when the number of nodes changes. So in addition to hashing the data key, we also calculate the hash of node names. Note that both the hashes should give values in the same range. Consider the hashed value ranges are 32-bit values. Now we glue the first value and last value of the range together to form a circle with values in the range as a point on the circle. The total number of nodes say 'N' corresponds to 'N' segments on the circle. Each of the data keys should fall in any one of these segments. Once we have all the hash values of data keys and nodes in place, we start from the hash value of one of the data keys and move clockwise. We keep on moving clockwise until we find the hash value of the node. All the data key found prior to finding the node belongs to this node. If we consider the ideal situation where we have data keys distributed uniformly in all of these N segments then by symmetry each of the nodes will have a 1/N fraction of total data keys.

No alt text provided for this image

In the above figure if we add another Node 4 between Node 1 and Node 2 the data movement would happen for only those keys which are between Node 1 and Node 4.

No alt text provided for this image

Concept of Virtual Nodes

The above example would work fine only if we have an ideal scenario where the data keys fall into the segments uniformly. In the real world, those scenarios are highly unlikely to happen. If the data is not evenly distributed some of the nodes can become hotspots that contain more data keys. To overcome this issue we can create 'k' virtual copies of each node. We can use k different hash functions to get k different values of the same node. We can then proceed with a similar step as above. By spreading the virtual nodes the probability of a node becoming a hotspot decreases as the loads are now evenly spread. Also if the cluster has varying node capacity we can increase or decrease the number of virtual nodes for that particular node depending on its capacity.

Usage of Consistent Hashing

Akamai Technologies, Teradata, BitTorrent, Amazon Dynamo DB, Apache Cassandra make use of the Consistent Hashing technique to distribute the loads and data across nodes.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics