Recently, I was working on a migration from Azure Function Apps to Azure Kubernetes Service. While building a CI/CD pipeline, I decided to try applying a self-healing approach. The idea is that what if we can detect failures earlier and automatically recover from them?
(If you just need a full template, scroll to the end.)
How can problems be detected and reverted in Kubernetes deployment?
We use Deployment to manage workloads in Kubernetes. And the power of deployment that we can check status and rollback.
After applying, we can check the status of the deployment to detect problems.
If we just run this command:
kubectl rollout status deployment/some-deployment
It will exit with 0
if deployment is ready. But if deployment is not ready and fails due to reasons, it will be stuck in this command and wait for deployment to complete. To escape this behaviour and exit with 1
if deployment is not ready on time we need to add a timeout argument:
kubectl rollout status deployment/some-deployment --timeout=1m
This argument will timeout our deployment if it does not succeed within the timeframe.
Reverting changes.
To revert changes in Kubernetes we can use this command:
kubectl rollout undo deployment/some-deployment
After
Rollback Concept & Prerequisites.
The rollback mechanism in Kubernetes is designed to revert a deployment to a previous state if the current deployment is unhealthy or fails to meet the desired criteria.
In this case you should be aware that any changes that are not reversable can lead to a situation where rollback will not help.
Good example of those changes a schema change in database. For example during deployment you deploy schema changes and new code that supports it. But during deployment something went wrong and the new code has error and the app cannot start.
In this case if you rollback to a previous version the app will not work as intended because of the schema changes.
To successfully work with automatic rollbacks your team should always apply concept of Rolling Migrations
.
Example Scenario.
Consider an application that requires a database schema update. The rolling migration process would involve the following steps:
- Deploy Code Supporting New Schema (Disabled): First, deploy the new application code that supports the new schema, but keep the new features disabled.
- Deploy Infrastructure Changes: Apply the database schema changes.
- Health Check: Introduce health checks to validate that the application can connect to the updated database schema.
- Enable New Features: Once the health checks pass, enable the new features in the application.
- Monitor Deployment: Continuously monitor the deployment to ensure everything is functioning correctly.
- Rollback if Necessary: If any issues are detected, rollback to the previous version of the application.
In this example the code is compatable with previous schema and even in the case of rollback the app will work as intended.
Better Deployment Status.
The deployment can get stuck and never completed because of the following:
- Insufficient quota
- Readiness probe failures
- Image pull errors
- Insufficient permissions
- Limit ranges
- Application runtime misconfiguration
Most of those problems are due to misconfigured deployments or clusters, but there is one that can be customised and can make our check even more robust. But Readiness probe failures
allow us to extend those reasons to a more sophisticated one.
If we specify the URL for Readiness probe failures
as our health check and extend it to validate more scenarios, like whether our application is able to connect external services like databases with the needed permissions, it will allow us to improve our Deployment Status Check.
Template Itself.
Azure DevOps yml pipeline template to validate AKS deployment and roll back if needed.
parameters:
- name: environment
type: string
displayName: Required. Represents the stage of deployment. Usualy it is Development, Test, Acceptance, Production. It is needed so you can have multiple stages in the same pipeline.
- name: azureServiceConnection
displayName: Required. Azure service connection that will provide access to ARM.
- name: aksName
displayName: Required. Name of the AKS.
- name: aksRg
displayName: Required. Name of the resource group where AKS is.
- name: aksSubscription
displayName: Required. Azure subscription ID where AKS is.
- name: namespace
displayName: Required. Namespace of the k8s deployment to check.
- name: deploymentName
displayName: Required. Name of the k8s deployment to check.
- name: timeout
displayName: Optional. Timeout for the deployment check. The default is 1m. Example 1m, 2m, 10m
default: 1m
- name: dependsOn
type: object
displayName: Optional. Pass list of previous stages to depend on them.
default: []
stages:
- stage: AksDeploymentHealthAndRollback${{ parameters.environment}}
displayName: ${{ parameters.environment}} — K8S Deployment Health and Rollback
dependsOn: ${{ parameters.dependsOn }}
jobs:
- job: AksDeploymentHealthCheck
displayName: ${{ parameters.environment}} — K8S Deployment Health
steps:
- checkout: self
displayName: Checkout
fetchTags: false
- task: AzureCLI@2
displayName: K8S Deployment Health
inputs:
azureSubscription: ${{ parameters.azureServiceConnection}}
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -e # fail when script is failing
sudo az aks install-cli
az aks get-credentials --name ${{ parameters.aksName }} \
--resource-group ${{ parameters.aksRg }} \
--subscription ${{ parameters.aksSubscription }} \
--overwrite-existing \
--file .kubeconfig-${{ parameters.aksName }}
export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}
# Set default namespace
kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
kubectl config get-contexts
# Pass kubeconfig to kubelogin to access k8s API
kubelogin convert-kubeconfig -l azurecli
# Check the rollout status of the deployment
kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}}
- job: ManualApprovalOfRollBack
displayName: ${{ parameters.environment}} — Manual Approval Of Rollback
dependsOn: AksDeploymentHealthCheck
condition: failed()
pool: server
steps:
- task: ManualValidation@0
displayName: Approve Rollback
timeoutInMinutes: 1440 # task times out in 1 day
- job: Rollback
displayName: ${{ parameters.environment}} — K8S Deployment Rollback
dependsOn: ManualApprovalOfRollBack
condition: eq(dependencies.ManualApprovalOfRollBack.result, 'Succeeded')
steps:
- checkout: self
displayName: Checkout
fetchTags: false
- task: AzureCLI@2
displayName: K8S Rollback
inputs:
azureSubscription: ${{ parameters.azureServiceConnection}}
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -e # fail when script is failing
sudo az aks install-cli
az aks get-credentials --name ${{ parameters.aksName }} \
--resource-group ${{ parameters.aksRg }} \
--subscription ${{ parameters.aksSubscription }} \
--overwrite-existing \
--file .kubeconfig-${{ parameters.aksName }}
export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}
# Set default namespace
kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
kubectl config get-contexts
# Pass kubeconfig to kubelogin to access k8s API
kubelogin convert-kubeconfig -l azurecli
# Rollback the deployment
kubectl rollout undo deployment/${{ parameters.deploymentName }}
- job: AksRollbackHealthCheck
displayName: ${{ parameters.environment}} — K8S Rollback Health
dependsOn: Rollback
condition: eq(dependencies.Rollback.result, 'Succeeded')
steps:
- checkout: self
displayName: Checkout
fetchTags: false
- task: AzureCLI@2
displayName: K8S Rollback Health
inputs:
azureSubscription: ${{ parameters.azureServiceConnection}}
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -e # fail when script is failing
sudo az aks install-cli
az aks get-credentials --name ${{ parameters.aksName }} \
--resource-group ${{ parameters.aksRg }} \
--subscription ${{ parameters.aksSubscription }} \
--overwrite-existing \
--file .kubeconfig-${{ parameters.aksName }}
export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}
# Set default namespace
kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
kubectl config get-contexts
# Pass kubeconfig to kubelogin to access k8s API
kubelogin convert-kubeconfig -l azurecli
# Check the rollout status of the deployment
kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}}
- job: Clean
displayName: Clean Up
dependsOn: [ AksDeploymentHealthCheck, ManualApprovalOfRollBack, Rollback ]
condition: always()
steps:
- checkout: none
- script: |
rm -rf ~/.kube/config
displayName: Remove kube config
Template behaviour:
Top comments (0)