Introduction
There are two common scaling methods: Vertical scaling and Horizontal scaling.
Vertical scaling involves adding more hardware, such as RAM or CPU, or increasing the number of server nodes. Horizontal scaling, on the other hand, means adding more instances of an app to fully utilize the available resources on a node or server.
However, horizontal scaling has its limits. Once a node's resources are maxed out, vertical scaling becomes necessary. This article will focus on horizontal scaling using Kubernetes Horizontal Pod Autoscaling (HPA), which automatically scales resources up or down based on system demands.
Implementation Process
1. Build a Docker image for your application.
2. Deploy the image using a Deployment and LoadBalancer service.
3. Configure HPA to automatically scale resources.
To use HPA for auto-scaling based on CPU/Memory, Kubernetes must have the metrics-server installed. If you’re using a cloud provider, the metrics-server is usually installed by default. For local Kubernetes setups, you need to manually install the metrics-server.
If you’re using Kind for a local Kubernetes setup, follow these steps to install the metrics-server after successfully creating the cluster:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Or install with Helm
helm repo add k8s-dashboard https://kubernetes.github.io/dashboard
helm upgrade --install kubernetes-dashboard k8s-dashboard/kubernetes-dashboard
1. Build a Docker Image for the Application
Use the following code block to create a NodeJS Express Server:
import express from 'express'
const port = 3000
const app = express()
app
.get('/', (_, res) => {
res.send('This is NodeJS Typescript Application! Current time is ' + Date.now())
})
.get('/sum', (req, res) => {
const value = +req?.query?.value
const start = Date.now()
const result = Array(+value)
.fill(0)
.map((_, i) => i)
.reduce((a, b) => a + b)
const now = Date.now()
const duration = now - start
res.json({duration, now, result})
})
.listen(port, () => {
console.log(`Server is running http://localhost:${port}`)
})
Next, let's build the Docker image and push it to Google Artifact Registry or Docker Hub. You can refer to my guide on how to do this here.
2. Deploy the image using a Deployment and a LoadBalancer service
Create a deployment.yml\
file that includes the configuration for the Deployment to deploy the image you built, along with a LoadBalancer service, as shown below:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-name
labels:
name: label-name
spec:
selector:
matchLabels:
app: label-name
template:
metadata:
labels:
app: label-name
spec:
restartPolicy: Always
containers:
- name: express-ts
image: express-ts
resources:
requests: # min
memory: "100Mi"
cpu: "100m"
limits: # max
memory: "300Mi"
cpu: "300m"
---
apiVersion: v1
kind: Service
metadata:
name: service-name
labels:
service: label-name
spec:
selector:
app: label-name
type: LoadBalancer
ports:
- protocol: TCP
port: 80 # port service
targetPort: 3000 # port pod
I've explained the details about deployment and the LoadBalancer service in this article.
Here, we also cover resource configuration. You can choose to configure CPU, Memory, or both, depending on which parameters you want to scale.
- If you define resource values, you can scale by a percentage of the initially defined resources.
- If you don't define resource values, you must specify the exact resource values to scale.
3. Configuring HPA for Auto-scaling Resources
You can include the HPA configuration either in your deployment.yml\
file or in a separate file with the following content:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-name
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deployment-name # target to deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu # scale base on CPU
target:
type: Utilization
averageUtilization: 80 # target 80%
behavior:
scaleDown:
policies:
- type: Pods
periodSeconds: 30
value: 3
stabilizationWindowSeconds: 120
- minReplicas, maxReplicas: min and max scaling resource
- metrics: Define the type of resource you want to scale; in this case, it's the CPU.
- averageUtilization: This is a percentage. When it exceeds this value, the system will scale up.
- behavior: This is optional. Here, it's used to define the scale-down behavior, allowing a maximum of 3 Pods to scale down in 30 seconds.
- stabilizationWindowSeconds: When the system remains stable for this duration, it will scale down (the default value is 5 minutes).
Next, apply to create the resource as follows:
kubectl apply -f deployment.yml
Note: I've defined all resources in a single file for simplicity, but in practice, you should separate each resource into individual YAML files for better management.
Information on the resource once created:
Please ensure that this API is working so we can continue testing the HPA.
Testing HPA
You can test the API using any method you know. Here, I provide a code block to send 10 requests every second. Replace the URL with the EXTERNAL-IP of the LoadBalancer service.
const numOfRequest = 10
const url = 'http://172.23.0.3/sum?value=10000000'
let idx = 0
setInterval(() => {
Promise.all(
Array(numOfRequest)
.fill(0)
.map(() =>
fetch(url)
.then(res => res.json())
.then(data => console.log('Completed', ++idx, data.duration))
.catch(console.error)
)
)
}, 1000)
After executing, the server resources will gradually increase, triggering the auto-scaling process.
You can check if HPA has performed the auto-scaling as follows:
You will notice that the number of Replicas will gradually increase when the CPU usage exceeds the 80% target (field value averageUtilization) and will gradually decrease after a period of time (field value stabilizationWindowSeconds) when the system stabilizes.
See you in the next articles!
If you found this content helpful, please visit the original article on my blog to support the author and explore more interesting content.
Some series you might find interesting:
Top comments (0)