Kubernetes: Pods and WorkerNodes — control the placement of the Pods on the Nodes
Kubernetes allows very flexible control over how its Pods will be located on servers, i.e. WorkerNodes.
This can be useful if you need to run a pod on a specific node configuration, for example — a WorkerNode must have a GPU, or an SSD instead of an HDD. Another example is when you need to place individual Pods next to each other to reduce their communication latency, or to reduce cross Availability-zone traffic (see AWS: Grafana Loki, InterZone traffic in AWS, and Kubernetes nodeAffinity).
And, of course, this is important for building a High Availability and Fault Tolerance architecture, when you need to divide pods into individual nodes or Availability Zones.
We have four main approaches to control how Kubernetes Pods are hosted on WorkerNodes:
- configure Nodes in such a way that they will accept only individual Pods that meet the criteria specified on the node
-
taints
andtolerations
: on the Node we set the taint, for which Pods must have the appropriate toleration to run on this node - configure the Pod itself in such a way that it will select only individual Nodes that meet the criteria specified in the Pod
- for this, we can use
nodeName
– only a Node with the specified name is selected - or
nodeSelector
to select Nodes with corresponding labels and their values - or
nodeAffinity
andnodeAntiAffinity
- the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the parameters of this Node - configure the Pod itself so that it will select a Node based on how other Pods are running
- for this, we can use
podAffinity
andpodAntiAffinity
- the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the other Pods on this Node - and a separate topic — Pod Topology Spread Constraints, i.e. the rules for placing Pods by failure domains — regions, Availability zones, or nodes
kubectl explain
Just a tip: you can always read the relevant documentation for any parameter or resource using kubectl explain
:
$ kubectl explain pod
KIND: Pod
VERSION: v1
DESCRIPTION:
Pod is a collection of containers that can run on a host. This resource is
created by clients and scheduled onto hosts.
…
Or:
$ kubectl explain Pod.spec.nodeName
KIND: Pod
VERSION: v1
FIELD: nodeName <string>
DESCRIPTION:
NodeName is a request to schedule this pod onto a specific node. If it is
non-empty, the scheduler simply schedules this pod onto that node, assuming
that it fits resource requirements.
Node Taints and Pods Tolerations
So, the first option is to set restrictions on the Node on what Pods can be run on it using Taints and Tolerations.
Here a taint
“repels” Pods that do not have a corresponding toleration
to that Node, and a toleration
“pulls” a Pod to a specific Node that has a corresponding one taint
.
For example, we can create a Node on which only Pods with some critical services such as controllers will be launched.
To do so, specify a tain
with the effect: NoSchedule
- that is, prohibit the creation of new Pods on this Node:
$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoSchedule
node/ip-10–0–3–133.ec2.internal tainted
Next, create a Pod with a toleration
with the key "critical-addons"
:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
tolerations:
- key: "critical-addons"
operator: "Exists"
effect: "NoSchedule"
Deploy, and check Pods on that Node:
$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10–0–3–133.ec2.internal
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default my-pod 1/1 Running 0 2m11s 10.0.3.39 ip-10–0–3–133.ec2.internal <none> <none>
dev-monitoring-ns atlas-victoriametrics-loki-logs-zxd9m 2/2 Running 0 10m 10.0.3.8 ip-10–0–3–133.ec2.internal <none> <none>
…
But where does Loki come from? Because while the Taint was set, the Scheduler managed to move a Loki’s Pod to this Node.
To prevent this, add a key NoExecute
to the Tain - then the scheduler will perform Pod eviction to move already running Pods from this Node to other Nodes:
$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoExecute
Check taints
now:
$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.spec.taints'
[
{
“effect”: “NoExecute”,
“key”: “critical-addons”,
“value”: “true”
},
{
“effect”: “NoSchedule”,
“key”: “critical-addons”,
“value”: “true”
}
]
For our Pod add the second one toleration
, otherwise it will be evicted from this Node too:
...
tolerations:
- key: "critical-addons"
operator: "Exists"
effect: "NoSchedule"
- key: "critical-addons"
operator: "Exists"
effect: "NoExecute"
Deploy and check Pods on this Node again:
$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10–0–3–133.ec2.internal
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default my-pod 1/1 Running 0 3s 10.0.3.246 ip-10–0–3–133.ec2.internal <none> <none>
kube-system aws-node-jrsjz 1/1 Running 0 16m 10.0.3.133 ip-10–0–3–133.ec2.internal <none> <none>
kube-system csi-secrets-store-secrets-store-csi-driver-cctbj 3/3 Running 0 16m 10.0.3.144 ip-10–0–3–133.ec2.internal <none> <none>
kube-system ebs-csi-node-46fts 3/3 Running 0 16m 10.0.3.187 ip-10–0–3–133.ec2.internal <none> <none>
kube-system kube-proxy-6ztqs 1/1 Running 0 16m 10.0.3.133 ip-10–0–3–133.ec2.internal <none> <none>
Now, on this Node, we have only our Pod, and Pods from DaemonSets which by default should run on all Nodes and have the corresponding tolerations
, see How Daemon Pods are scheduled.
In addition to the Exists
that only checks for the presence of a specified label, it is possible to check the value of this label.
To do so, use Equal
in the operator, and add a required value:
...
tolerations:
- key: "critical-addons"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "critical-addons"
operator: "Equal"
value: "true"
effect: "NoExecute"
To delete a tain
- add a minus at the end:
$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoSchedule-
node/ip-10–0–3–133.ec2.internal untainted
$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoExecute-
node/ip-10–0–3–133.ec2.internal untainted
Choosing a Node by a Pod: nodeName
, nodeSelector
, and nodeAffinity
Another approach is when we configure a Pod in such a way that “it” chooses which Node to run on.
For this we have nodeName
, nodeSelector
, nodeAffinity
and nodeAntiAffinity
. See Assign Pods to Nodes.
nodeName
The most straightforward way. Has precedence over all others:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
nodeName: ip-10-0-3-133.ec2.internal
nodeSelector
With the nodeSelector
we can choose Nodes which has a corresponding labels.
Add a label to the Node:
$ kubectl label nodes ip-10–0–3–133.ec2.internal service=monitoring
node/ip-10–0–3–133.ec2.internal labeled
Check it:
$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.metadata.labels'
{
…
“kubernetes.io/hostname”: “ip-10–0–3–133.ec2.internal”,
“kubernetes.io/os”: “linux”,
“node.kubernetes.io/instance-type”: “t3.medium”,
“service”: “monitoring”,
…
In the Pod’s manifest set the nodeSelector
:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
nodeSelector:
service: monitoring
If several labels are assigned in the Pod’s nodeSelector
, then the corresponding Node must have all these labels in order for this Pod to run on it.
nodeAffinity
and nodeAntiAffinity
nodeAffinity
and nodeAntiAffinity
operate in the same way as the nodeSelector
, but have more flexible capabilities.
For example, you can set hard or soft launch limits — for a soft limit, the scheduler will try to launch a Pod on the corresponding Node, and if it cannot, it will launch it on another. Accordingly, if you set a hard limit and the scheduler cannot start the Pod on the selected Node, the Pod will remain in Pending status.
The hard limit is set in the field .spec.affinity.nodeAffinity
with the requiredDuringSchedulingIgnoredDuringExecution
, and the soft limit is set with the preferredDuringSchedulingIgnoredDuringExecution
.
For example, we can launch a Pod in AvailabilityZone us-east-1a or us-east-1b using node-label topology.kubernetes.io/zone
:
$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.metadata.labels'
{
…
“topology.kubernetes.io/region”: “us-east-1”,
“topology.kubernetes.io/zone”: “us-east-1b”
}
Set a hard-limit:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
Or a soft limit. For example, with a non-existent label:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: non-exist-node-label
operator: In
values:
- non-exist-value
In this case, the Pod will still be launched on whichever Node is most available.
You can also combine conditions:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: non-exist-node-label
operator: In
values:
- non-exist-value
When using several conditions in the requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
, the first one that coincided with the Node's label will be selected.
When using several conditions in the matchExpressions
field they all must match.
In the operator you can use operators In, NotIn, Exists, DoesNotExist, Gt (greater than) and Lt (less than).
soft-limit and the weight
In the preferredDuringSchedulingIgnoredDuringExecution
you can set a weight of the condition setting a value from 1 to 100.
In this case, if all other conditions coincide, the scheduler will select a Node with the largest condition weight:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx:latest
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- weight: 100
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1b
This Pod will be launched on a Node in the us-east-1b zone:
$ kubectl get pod my-pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-pod 1/1 Running 0 3s 10.0.3.245 ip-10–0–3–133.ec2.internal <none> <none>
And the zone of this Node:
$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1b
podAffinity
and podAntiAffinity
Similar to selecting a Node using hard and soft limits, you can adjust Pod Affinity depending on what labels Pods already running on the Node will have. See Inter-pod affinity and anti-affinity.
For example, Grafana Loki has three Pods — Read, Write, and Backend.
We want to run the Read and Backend in the same AvailabilityZone to avoid cross-AZ traffic, but at the same time, we want them not to run on those Nodes where there are Write Pods.
Loki Pods have labels corresponding to a component — app.kubernetes.io/component=read
, app.kubernetes.io/component=backend
, and app.kubernetes.io/component=write
.
So, for the Read Pod, we can set a podAffinity
to Pods with the label app.kubernetes.io/component=backend
, and podAntiAffinity
to Pods with a label app.kubernetes.io/component=read
:
...
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- backend
topologyKey: "topology.kubernetes.io/zone"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- write
topologyKey: "kubernetes.io/hostname"
...
Here in the podAffinity.topologyKey
we set that we want to place Pods using the topology.kubernetes.io/zone
domain - that is, topology.kubernetes.io/zone
for Read Pods must match the Backend Pods.
And in the podAntiAffinity.topologyKey
we set the kubernetes.io/hostname
, that is, do not place on WorkerNodes, where there are Pods with the label app.kubernetes.io/component=write
.
Let’s deploy and check where there is a Write Pod:
$ kubectl -n dev-monitoring-ns get pod loki-write-0 -o json | jq '.spec.nodeName'
“ip-10–0–3–53.ec2.internal”
And AvailabilityZone of this Node:
$ kubectl -n dev-monitoring-ns get node ip-10–0–3–53.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1b
Check where the Backend Pod is placed:
$ kubectl -n dev-monitoring-ns get pod loki-backend-0 -o json | jq '.spec.nodeName'
“ip-10–0–2–220.ec2.internal”
And its zone:
$ kubectl -n dev-monitoring-ns get node ip-10–0–2–220.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a
And now, a Read Pod:
$ kubectl -n dev-monitoring-ns get pod loki-read-698567cdb-wxgj5 -o json | jq '.spec.nodeName'
“ip-10–0–2–173.ec2.internal”
The Node is different from the Write or Backend Nodes, but:
$ kubectl -n dev-monitoring-ns get node ip-10–0–2–173.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a
The same AvailabilityZone as in the Backend Pod.
Pod Topology Spread Constraints
We can configure Kubernetes Scheduler in such a way that it distributes Pods by “domains”, that is, by nodes, regions, or Availability Zones. See Pod Topology Spread Constraints.
For this, we can set the necessary config in the field spec.topologySpreadConstraints
, which describes exactly how pods will be created.
For example, we have 5 WorkerNodes in two AvailabilityZones.
We want to run 5 Pods and for fault tolerance we want each Pod to be on a separate Node.
Then our config for a Deployment can look like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 5
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-container
image: nginx:latest
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Here:
-
maxSkew
: the maximum difference in the number of pods in one domain (topologyKey
) - plays a role only if
whenUnsatisfiable=DoNotSchedule
, whenwhenUnsatisfiable=ScheduleAnyway
then a Pod will be created regardless of the conditions -
whenUnsatisfiable
: can have valueDoNotSchedule
- do not allow Pods to be created, orScheduleAnyway
-
topologyKey
: a WorkerNode label, by which the domain will be selected, that is, by which label we group the Nodes on which the placement of Pods is calculated -
labelSelector
: what Pods to take into account when placing new Pods (for example, if Pods are from different Deployments, but should be placed in the same way - then in both Deployments we configuretopologySpreadConstraints
with mutual oneslabelSelecto
r)
In addition, you can set the nodeAffinityPolicy
parameters and/or nodeTaintsPolicy
with the Honor
or Ignore
values to configure if nodeAffinity
or nodeTaints
of a Pod must be taken into account during calculating the placement of a Pod.
Let’s deploy and check the Nodes of these Pods:
$ kk get pod -o json | jq '.items[].spec.nodeName'
“ip-10–0–3–53.ec2.internal”
“ip-10–0–3–22.ec2.internal”
“ip-10–0–2–220.ec2.internal”
“ip-10–0–2–173.ec2.internal”
“ip-10–0–3–133.ec2.internal”
All are placed on separate Nodes.
Originally published at RTFM: Linux, DevOps, and system administration.
Top comments (0)