In go-zero's distributed caching implementation, we used consistent hash algorithm a lot. In this article, we will talk about the algorithm of the consistent hash and its implementation details in go-zero.
Taking storage as an example, it is impossible to say that our storage is just a single node in the whole microservice system.
- First of all is to improve stability. If single node down, the entire storage is facing service unavailability.
- Second, for data fault tolerance. The single node data loss causes the data loss. While for the multi-node case, the node has a backup, unless the mutual backup node is destroyed at the same time.
So the question arises, which node should the data be written to in the multi-node case?
hash
So essentially: we need an input value that can be "compressed" ** and converted to a smaller value, usually unique and in an extremely compact format, like uint64**.
- idempotent: each time the hash is computed with the same value, it must guarantee that the same value is obtained
This is what the hash
algorithm does.
But take the normal hash
algorithm for routing, e.g., key % N
. If a node drops out of the cluster due to an exception or a heartbeat exception, then hash route
will cause a lot of data to be redistributed
to different nodes. When a node accepts a new request, it needs to re-process the logic to get the data: if it is in the cache, it can easily cause a cache avalanche.
In this case it is necessary to introduce the consistent hash
algorithm.
consistent hash
Let's see how consistent hash
solves these problems.
rehash
Let's start by solving the massive rehash
problem.
As shown above, when a new node is added, the only key affected is key31
. When a new node is added (eliminated), only the data near that node will be affected. The data of other nodes will not be affected, thus solving the problem of node changes.
This is exactly what it is: monotonicity. This is also the reason why the normal hash
algorithm cannot satisfy distributed scenarios.
Data skewing
In fact, the above figure shows that most of the keys are currently concentrated on node 1
. If when the number of nodes is relatively small, it can trigger most keys concentrated on a certain node
, the problem found when monitoring is: uneven load between nodes.
To solve this problem, consistent hash
introduces the concept of virtual node
.
Since the load is uneven, we artificially construct a balanced scenario, but there are only so many actual nodes. So we use virtual node
to divide the region, while the actual nodes served are still the same as the previous ones.
Concrete implementation
Let's start with Get()
.
Get
First, let's talk about the principle of implementation.
- calculate the hash of
key
- find the index of the first matching
virtual node
and fetch the correspondingh.keys[index]
: virtual node hash value - go to this
ring
and find anactual node
that matches it
In fact, we can see that the ring
gets a []node
. This is because in computing virtual node hash
, there may be a hash conflict where a different virtual node hash
corresponds to an actual node.
This also means that node
and virtual node
are in a one-to-many relationship. And the ring
inside is the following design.
This actually shows the allocation strategy of the consistency hash.
-
virtual node
is used as the value domain division. Thekey
is used to get thenode
, which is bounded by thevirtual node
. -
virtual node
ensures that the keys assigned to different nodes are approximately evenly distributed byhash
. That is, split binding. - When a new node is added, multiple
virtual nodes
are assigned. The new node can load the pressure of multiple existing nodes, which makes it easier to achieve load balancing when expanding capacity from a global perspective.
Add Node
After reading Get
, you actually know roughly how the whole consistent hash is designed.
type ConsistentHash struct {
hashFunc Func // hash function
replicas int // virtual node amplification factor
keys []uint64 // store virtual node hash
ring map[uint64][]interface{} // virtual node to actual node correspondence
nodes map[string]lang.PlaceholderType // actual node storage [easy to find quickly, so use map]
lock sync.RWMutex
}
Well so that the basic a consistent hash is implemented completely.
code: https://github.com/zeromicro/go-zero/blob/master/core/hash/consistenthash.go
Usage scenarios
The beginning actually says that consistency hash can be widely used in distributed systems for.
- distributed caching. You can build a
cache proxy
on a storage system likeredis cluster
and control the routing freely. For this routing rule, we can use the consistent hash algorithm - service discovery
- distributed scheduling of tasks
All the above distributed systems can be used in the load balancing module.
Project address
https://github.com/zeromicro/go-zero
Welcome to use go-zero and star to support us!
Top comments (0)