Load balancing is a big topic and we can talk about:
- slow start: assigning lower weights to newly added nodes to avoid overloading
- priority: Different availability zones (AZs) have different priorities, and nodes not in the current availability zone are only added as backups when the nodes in the current availability zone are not available
- subset: Grouped load balancing, a group will be selected by the load balancing algorithm first, and then a specific node will be selected in the group by the load balancing algorithm.
- retry: When load balancing hits retry, there are some additional scenarios to consider. In retry, usually we need to select another node instead of reselecting the current node. In addition, after retrying all nodes, we usually do not retry more than one round.
This article will focus on the cornerstone of each of these features - the load balancing algorithm.
Random
Random load balancing means that a node is selected at random. Since this approach is stateless, it is the easiest load balancing to implement. But this is an advantage for developers, not users. Random load balancing only guarantees a balance in mathematical expectations, not in microscopic scales. It is possible for several requests in a row to hit the same node, just as bad luck always strikes. The black swan event is a shadow that random load balancing cannot erase. The only scenario where I recommend random load balancing is when it is the only option.
RoundRobin
RoundRobin means that each node will be selected in turn. For scenarios where all nodes have the same weight, it is not difficult to implement RoundRobin. You just need to record the currently selected node and then select its next node the next time.
For scenarios where the weights are not the same, it is necessary to consider how to make the selected nodes balanced enough. Suppose there are two nodes A and B with weights 5 and 2. If we just use a simple RoundRobin implementation which we used with the same weights, we get the following result:
A B
A B
A
A
A
Node A will be selected at most 4 times in a row (3 times at the end of the current round plus 1 time in the next round). Considering that the weight ratio of nodes A and B is 2.5:1, this behavior of selecting node A 4 times in a row is not commensurate with the weight ratio of the two nodes.
Therefore, for this scenario, we need to go beyond simple node-by-node polling and make the nodes with different weights as balanced as possible at the micro level. Using the previous example of nodes A and B, a micro-level equilibrium distribution should look like this:
A
A B
A
A
A B
Node A is selected at most 3 times in a row, which is not very different from the weight ratio of 2.5:1.
When implementing the RoundRobin algorithm with weights, please do not invent a new algorithm if possible. RoundRobin implementations with weights are more error-prone. It may happen that it works fine in local development tests and runs OK online for a while, until the user inputs a special set of values and then the imbalance happens. The mainstream implementation should be consulted, and if adjustments need to be made to the mainstream implementation, it is best to provide a mathematical proof.
Next, let's look at how the mainstream implementations -- Nginx and Envoy -- do it.
The Nginx implementation looks roughly like this.
- each node has its own current score. Each time you select a node, you iterate through the nodes and add a value to the score that is related to the weight of the node.
- the node with the highest score is selected each time.
- the sum of all weights is subtracted from the score when the node is selected.
The higher the weight of the node, the faster it recovers after subtracting the score, and the more likely it is to continue to be selected. And there is a recovery process here, which ensures that the same node is unlikely to be selected the next time.
This code is complicated by the fact that it is coupled with the passive health check (there are multiple weights; effect_weight needs to be adjusted according to max_fails). Since the specific implementation of Nginx is not the focus of this article, interested readers can check it out for themselves.
Envoy's implementation is a bit clearer. It is based on a simplified version of the EDF algorithm to do node selection. In short, it uses a priority queue to select the current best node. For each node, we record two values.
- deadline the next time the node needs to be taken out
- last_popped_time the last time this node was taken out
(Envoy's implementation code is a bit different from this. Here we use last_popped_time
instead of offset_order
in Envoy for the purpose of easy understanding)
Again, take our A and B nodes as an example.
Nodes A and B are given 1/weight as their respective scores. The algorithm runs as follows.
- a priority queue is constructed and sorted by comparing deadline first and last_popped_time when the former is the same. the initial value of each node is its respective score.
- each time a node is selected, the latest value is popped from the priority queue.
- each time a node is selected, its last_popped_time is updated to the deadline at the time of selection, and the corresponding score is added to the deadline and reinserted into the queue.
Each selection is as follows:
round | A deadline | B deadline | A last_popped_time | B last_popped_time | Selected |
---|---|---|---|---|---|
1 | 1/5 | 1/2 | 0 | 0 | A |
2 | 2/5 | 1/2 | 1/5 | 0 | A |
3 | 3/5 | 1/2 | 2/5 | 0 | B |
4 | 3/5 | 1 | 2/5 | 1/2 | A |
5 | 4/5 | 1 | 3/5 | 1/2 | A |
6 | 1 | 1 | 4/5 | 1/2 | B |
7 | 6/5 | 1 | 4/5 | 1 | A |
As you can see, with the EDF algorithm, node A is selected at most 3 times in a row (1 at the end of the current loop plus 2 at the next loop), which is not much different from the weight ratio of 2.5:1. In addition, compared to Nginx's algorithm, the time complexity of node selection under EDF is mainly O(log)
when reinserting, which is faster than comparing scores node by node when there are a large number of nodes.
Least Request
The Least Request algorithm, also known as the Least Connection algorithm, comes from the early days when each request often corresponded to one connection and was often used for load balancing long connections. If the load of the service is closely related to the number of current requests, for example, in a push service where the number of connections managed by each node is expected to be balanced, then an ideal choice would be to use the least request algorithm. Alternatively, if the requests take a long time and vary in length, using the least request algorithm also ensures that the number of requests to be prepared for processing on each node is balanced to avoid long queues. For this case, the EWMA algorithm mentioned later is also suitable.
To implement the least request algorithm, we need to keep track of the current number of requests on each node. A request is added one when it comes in and subtracted one when it ends. For the case where all nodes have the same weight, the O(n)
traversal is used to find the least requested node. We can also optimize it further. By P2C algorithm, we can randomly select two nodes at a time and achieve an approximate effect like an O(n)
traversal, just with O(1)
time complexity. In fact, the P2C algorithm can be used to optimize the time complexity for all cases that satisfy the following conditions:
- each node has a score
- all nodes have the same weight
So some frameworks will directly abstract a p2c
middleware as a generic capability.
There is no way to use the P2C algorithm when it involves different weights for each node. We can adjust the weight to the current number of requests, and make it weight / (1 + number of requests)
. The more requests a node gets, the more the current weight is reduced. For example, if a node with weight 2 has 3 requests, then the adjusted weight is 1/2. If a new request comes in, then the weight becomes 2/5. By dynamically adjusting the weight, we can turn the least request with weight into a RoundRobin with weight, and then use traversal or priority queues to process it.
Hash
There are times when a client needs to be guaranteed access to a fixed server. For example, it is required to proxy requests from clients of the same session to the same node, or to route to a fixed node based on the client IP. In this case we need to use Hash algorithm to map the client's features to a node. However, a simple Hash has the problem that if the number of nodes changes, it will amplify the number of requests affected.
Suppose this simple Hash is to take the number of nodes as a remainder, and the requests are for numbers 1 to 10. The number of nodes starts out at 4 and then becomes 3. The result is:
1: 0 1 2 3 0 1 2 3 0 1
2: 0 1 2 0 1 2 0 1 2 0
We can see that 70% of the requests correspond to nodes that have changed, much more than the 25% change in the number of nodes.
So in practice, we use Consistent Hash more often, and only consider the general Hash algorithm if it is not available.
Consistent Hash
Consistent Hash is an algorithm designed to reduce the number of significant changes in the result when re-hashing. In the previous Hash algorithm, since the result of the Hash is strongly correlated with the number of nodes, once the number of nodes changes, the Hash result changes drastically. So can we make the Hash result independent of the number of nodes? Consistent Hash provides us with a new idea.
The most common consistent Hash algorithm is ring hash, where the entire Hash space is considered as a ring, and each node is mapped to a point on the ring by the Hash algorithm, and a Hash value is calculated for each request, and the nearest node is found clockwise according to the Hash value. In this way, there is no relationship between the requested Hash value and the number of nodes. When the nodes on the ring change, the requested Hash value does not change, only the nearest node may be different.
The reader may ask the question, if the position of a node depends on the value of Hash, how can we ensure that it is balanced distribution? Hash algorithms are designed with the possibility of reducing collisions. A high quality algorithm should spread the result of the Hash mapping as much as possible. Of course, if the Hash is done on only a limited number of nodes, the results will inevitably not be spread out enough. Therefore, the concept of virtual nodes is introduced in Consistent Hash. Each real node will correspond to N virtual nodes, say 100. The Hash value of each virtual node is obtained by an algorithm like Hash(node + "_" + virtual_node_id)
. Thus a real node, will correspond to N virtual nodes on the Hash ring. From a statistical point of view, we can assume that as long as the value of N is large enough, the standard deviation of the distances between nodes will be smaller and the distribution of nodes on the ring will be more balanced.
However, N can not be infinitely large. Even virtual nodes on the ring need real memory addresses to record their locations. the larger N is obtained, the more balanced the nodes are, but the more memory is consumed. The Maglev algorithm is another consistent Hash algorithm designed to optimize memory consumption. This algorithm uses less memory while guaranteeing the same balance due to the use of different data structures. (Or provide better balance while using the same memory, depending on which part is the same).
EWMA
The EWMA (Exponential Weighted Moving Average) algorithm is an algorithm that uses response time for load balancing. As the name suggests, it is calculated as an "exponentially weighted moving average".
Suppose the current response time is R, the time since the last visit is delta_time, and the score at the last visit is S1, then the current score S2
is:
S2 = S1 * weight + R * (1.0 - weight)
, where weight = e ^ -delta_time/k
. k is a pre-fixed constant in the algorithm.
It is Exponential Weighted: the longer the last visit is from the present, the less it affects the current score.
It is Moving: the current score is adjusted from the last score.
It is Average: if delta_time is large enough, weight is small enough and the score is close to the current response time; if delta_time is small enough, weight is large enough and the score is close to the last score. Overall, the score is the result of adjusting the response time over time.
Careful readers will ask, since the weight is calculated by delta_time, where should the weight specified by the user in the configuration be placed? EWMA is an adaptive algorithm that dynamically adjusts to the upstream state. If you find that you need to configure weights, then your scenario is not suitable for using EWMA. in fact, since the EWMA algorithm does not worry about weights, many people consider it as a replacement for the slow start feature.
But EWMA is not a panacea. Since EWMA is a response time based algorithm, it does not work if the upstream response time has little to do with the upstream node, such as the push scenario mentioned earlier in the introduction of the least request algorithm, where the response time depends on the push policy, which is not a good match for EWMA.
In addition, the EWMA algorithm has an inherent flaw - the response time does not necessarily reflect the full picture of the problem. Imagine a scenario where a node upstream keeps throwing 500 status code errors fast. In the opinion of the EWMA algorithm, this node is an excellent node, after all, it has an unparalleled response time. As a result, the majority of the traffic will hit this node. So when you use EWMA, be sure to also turn on health checks and take off problematic nodes in a timely manner. There are times, though, when a 4xx status code can also cause traffic imbalance. For example, in a grayscale upgrade, an incorrect check is added to the new version that will reject some of the correct requests on the production environment (returning a 400 status code). Since EWMA tends to favor more responsive nodes, more requests will fall to this faulty version.
Top comments (0)