Consistent hashing is a method of distributing requests among a dynamic set of servers which will allows for a more efficient use of resources and improved performance, as well as the ability to easily add or remove servers without disrupting the system. But, Before we discuss consistent hashing it is important to know what is hashing.
Hashing is a technique of converting data into a fixed-size output called a hash or a digest. An example of hashing is to use modulo in a distributed caching system with 4 servers ( s0, s1, s2, s3 ). The hash function will use IP ( 192.168.1.100 ) of the incoming request as the key and divide it with the total number of servers( 4 ), then remainder will be the server to which request should be redirected to.
hash(192.168.1.100) = (192.168.1.100 % 4) = 0
So, In the above example incoming request would be assigned to s0. Given below is an example how mapping to different servers will look like when multiple requests are being sent.
Modulo Operator Limitations as Hash Function
In practice, it is not recommended to use modulo operator as a simple hash function, as it can lead to poor distribution of keys and poor performance. Suppose In the above example, a server gets removed. Then keys distribution will be like this.
Here we need to remap all the request to the different servers which will lead the client to fetch data from the wrong servers and cause cache misses with uneven load distribution. Because of these issues consistent hashing came into play.
Consistent Hashing Working
A virtual circle, also known as a
hash ring
, is created, where each server is assigned a unique position on the ring.A hash function, such as SHA-1 or MD5, is used to map each server and each data item ( can be IP or any kind of data like key in key-value pair systems) to a unique position on the virtual circle. Hash function also defines the
hash space
which refers to the range of values that the hash function can output. The hash space of SHA-1 is 2¹⁶⁰ which means there are 1.46 x 10⁴⁸ unique hash values that can be produced by the SHA-1 hash function.
When a request for a data item is received, the hash function is used to determine the position of the data item on the ring.
The request is then directed to the server whose position on the ring is closest to the position of the data item by moving in clockwise direction.
When a server is added or removed from the system, the positions of the other servers on the ring do not change, but the data items that were hashed to positions on the ring that are closest to the new or removed server, are re-hashed to the new closest server.
This way, only a small subset of data items need to be re-hashed when a server is added or removed, minimizing the disruption to the system.
Hash Space and Hash Ring Limitations
Hash space limitations: The size of the hash space, which is the range of values that the hash function can output, can be a limiting factor in consistent hashing. If the hash space is too small, it may not be able to accommodate all the servers or data items in the system, resulting in a high number of collisions. On the other hand, if the hash space is too large, it may not make efficient use of resources.
Hash ring limitations: The use of a hash ring can also have limitations. The number of positions on the hash ring is fixed, and if the number of servers or data items exceeds the number of positions, collisions will occur.
Scalability: Consistent hashing can become less efficient as the number of servers or data items increases, as the number of remappings required can become large.
Load balancing: Consistent hashing does not take into account the load on each server, so it may not distribute the load evenly among the servers.
Performance: Consistent hashing can have poor performance when the number of collisions increases.
Complexity: Consistent hashing can be computationally expensive and time-consuming when a server is added or removed from the system, as it requires remapping of data items or requests.
Limited fault tolerance: Consistent Hashing doesn't take into account the fault tolerance, if a node goes down it will lead to a redistribution of keys which can cause a temporary disruption in the system.
To overcome some of the limitations of hash ring and hash space limitation, virtual nodes are being introduced.
Virtual Nodes
In consistent hashing, virtual nodes are used to improve the distribution of data or requests among the servers. Each physical server in the system is assigned multiple virtual nodes or "tokens" and these tokens are used to determine the positions on the hash ring. This allows for a more even distribution of data or requests among the servers and can improve the performance of the system.
In the above diagram we can clearly see servers are distributed at different positions on the hash ring which provides more even data distribution and enhanced load balancing on each server. Each virtual node is being read as s0_0, s0_1, s0_2 where s0 is the server and 0,1,2 being the position of the virtual node on hash ring.
Benefits of Virtual Nodes
Hash space limitations: By using virtual nodes, the size of the hash space can be increased, reducing the number of collisions that occur. For example, if a physical server is assigned 4 virtual nodes ( s0_0, s0_1, s0_2, s0_3 ), it would have 4 times more positions on the hash ring keeping the hashed value constant than a physical server with 1 virtual node ( s0_0 ). This increases the number of possible positions on the hash ring, reducing the chance of collisions and improving the distribution of data or requests among the servers.
Hash ring limitations: The use of virtual nodes allows for a larger number of positions on the hash ring, increasing the number of servers or data items that can be accommodated.
Scalability: Virtual nodes can improve the scalability of consistent hashing, as it can accommodate a large number of servers or data items without a significant increase in the number of remappings required.
Load balancing: Virtual nodes can improve load balancing by distributing the load more evenly among the servers.
Performance: Using virtual nodes can improve the performance of consistent hashing, as it reduces the number of collisions and remappings required.
Complexity: Using virtual nodes can reduce the computational complexity of consistent hashing, as it reduces the number of remappings required when a server is added or removed.
Fault tolerance: Virtual nodes can increase the fault tolerance of the system, because if a node goes down, the load is distributed among the other virtual nodes and physical nodes.
Conclusion
In conclusion, consistent hashing is a method of distributing requests among a dynamic set of servers. It uses a hash function to map each server and data item to a unique position on a virtual circle, or "hash ring," and then uses the hash of the request to determine which server the request should be directed to. The key benefit of consistent hashing is that it allows for the addition or removal of servers with minimal remapping of requests to servers.
Consistent hashing is used in several distributed systems, such as distributed caching, distributed storage, and distributed databases, to distribute the load of data and requests among the available servers. This allows for a more efficient use of resources and improved performance, as well as the ability to easily add or remove servers without disrupting the system.
However, consistent hashing has some limitations, such as hash space limitations, hash ring limitations, scalability, load balancing, performance, and complexity. To overcome these limitations, virtual nodes, dynamic consistent hashing, rendezvous hashing, and Maglev hashing have been proposed as alternative methods.
Overall, consistent hashing is a useful technique for distributing data or requests among servers or resources in a dynamic and efficient manner, but it is important to consider the trade-offs involved and to choose the best method for the specific use case.
Top comments (0)