DEV Community

Cover image for Azure DevOps CD Self-Reverting AKS Failed Deployment
Andrei Kniazev
Andrei Kniazev

Posted on • Updated on

Azure DevOps CD Self-Reverting AKS Failed Deployment

Recently, I was working on a migration from Azure Function Apps to Azure Kubernetes Service. While building a CI/CD pipeline, I decided to try applying a self-healing approach. The idea is that what if we can detect failures earlier and automatically recover from them?

(If you just need a full template, scroll to the end.)

How can problems be detected and reverted in Kubernetes deployment?

We use Deployment to manage workloads in Kubernetes. And the power of deployment that we can check status and rollback.

After applying, we can check the status of the deployment to detect problems.

If we just run this command:

kubectl rollout status deployment/some-deployment
Enter fullscreen mode Exit fullscreen mode

It will exit with 0 if deployment is ready. But if deployment is not ready and fails due to reasons, it will be stuck in this command and wait for deployment to complete. To escape this behaviour and exit with 1 if deployment is not ready on time we need to add a timeout argument:

kubectl rollout status deployment/some-deployment --timeout=1m
Enter fullscreen mode Exit fullscreen mode

This argument will timeout our deployment if it does not succeed within the timeframe.

Reverting changes.

To revert changes in Kubernetes we can use this command:

kubectl rollout undo deployment/some-deployment
Enter fullscreen mode Exit fullscreen mode

Before
Image description

After

Image description

Rollback Concept & Prerequisites.

The rollback mechanism in Kubernetes is designed to revert a deployment to a previous state if the current deployment is unhealthy or fails to meet the desired criteria.

In this case you should be aware that any changes that are not reversable can lead to a situation where rollback will not help.

Good example of those changes a schema change in database. For example during deployment you deploy schema changes and new code that supports it. But during deployment something went wrong and the new code has error and the app cannot start.

In this case if you rollback to a previous version the app will not work as intended because of the schema changes.

To successfully work with automatic rollbacks your team should always apply concept of Rolling Migrations.

Example Scenario.

Consider an application that requires a database schema update. The rolling migration process would involve the following steps:

  • Deploy Code Supporting New Schema (Disabled): First, deploy the new application code that supports the new schema, but keep the new features disabled.
  • Deploy Infrastructure Changes: Apply the database schema changes.
  • Health Check: Introduce health checks to validate that the application can connect to the updated database schema.
  • Enable New Features: Once the health checks pass, enable the new features in the application.
  • Monitor Deployment: Continuously monitor the deployment to ensure everything is functioning correctly.
  • Rollback if Necessary: If any issues are detected, rollback to the previous version of the application.

In this example the code is compatable with previous schema and even in the case of rollback the app will work as intended.

Better Deployment Status.

The deployment can get stuck and never completed because of the following:

  • Insufficient quota
  • Readiness probe failures
  • Image pull errors
  • Insufficient permissions
  • Limit ranges
  • Application runtime misconfiguration

Most of those problems are due to misconfigured deployments or clusters, but there is one that can be customised and can make our check even more robust. But Readiness probe failures allow us to extend those reasons to a more sophisticated one.

If we specify the URL for Readiness probe failures as our health check and extend it to validate more scenarios, like whether our application is able to connect external services like databases with the needed permissions, it will allow us to improve our Deployment Status Check.

Template Itself.

Azure DevOps yml pipeline template to validate AKS deployment and roll back if needed.

parameters:
- name: environment
  type: string
  displayName: Required. Represents the stage of deployment. Usualy it is Development, Test, Acceptance, Production. It is needed so you can have multiple stages in the same pipeline.
- name: azureServiceConnection
  displayName: Required. Azure service connection that will provide access to ARM.
- name: aksName
  displayName: Required. Name of the AKS.
- name: aksRg
  displayName: Required. Name of the resource group where AKS is.
- name: aksSubscription
  displayName: Required. Azure subscription ID where AKS is.
- name: namespace
  displayName: Required. Namespace of the k8s deployment to check.
- name: deploymentName
  displayName: Required. Name of the k8s deployment to check.
- name: timeout
  displayName: Optional. Timeout for the deployment check. The default is 1m. Example 1m, 2m, 10m
  default: 1m
- name: dependsOn
  type: object
  displayName: Optional. Pass list of previous stages to depend on them.
  default: []

stages:
  - stage: AksDeploymentHealthAndRollback${{ parameters.environment}}
    displayName: ${{ parameters.environment}} — K8S Deployment Health and Rollback
    dependsOn: ${{ parameters.dependsOn }}
    jobs:
      - job: AksDeploymentHealthCheck
        displayName: ${{ parameters.environment}} — K8S Deployment Health
        steps:
        - checkout: self
          displayName: Checkout
          fetchTags: false
        - task: AzureCLI@2
          displayName: K8S Deployment Health
          inputs:
            azureSubscription: ${{ parameters.azureServiceConnection}}
            scriptType: bash
            scriptLocation: inlineScript
            inlineScript: |
              set -e # fail when script is failing
              sudo az aks install-cli

              az aks get-credentials --name ${{ parameters.aksName }} \
                --resource-group ${{ parameters.aksRg }} \
                --subscription ${{ parameters.aksSubscription }} \
                --overwrite-existing \
                --file .kubeconfig-${{ parameters.aksName }}

              export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}

              # Set default namespace
              kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
              kubectl config get-contexts

              # Pass kubeconfig to kubelogin to access k8s API
              kubelogin convert-kubeconfig -l azurecli

              # Check the rollout status of the deployment
              kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}}

      - job: ManualApprovalOfRollBack
        displayName: ${{ parameters.environment}} — Manual Approval Of Rollback
        dependsOn: AksDeploymentHealthCheck
        condition: failed()
        pool: server
        steps:
        - task: ManualValidation@0
          displayName: Approve Rollback
          timeoutInMinutes: 1440 # task times out in 1 day

      - job: Rollback
        displayName: ${{ parameters.environment}} — K8S Deployment Rollback
        dependsOn: ManualApprovalOfRollBack
        condition: eq(dependencies.ManualApprovalOfRollBack.result, 'Succeeded')
        steps:
        - checkout: self
          displayName: Checkout
          fetchTags: false
        - task: AzureCLI@2
          displayName: K8S Rollback
          inputs:
            azureSubscription: ${{ parameters.azureServiceConnection}}
            scriptType: bash
            scriptLocation: inlineScript
            inlineScript: |
              set -e # fail when script is failing
              sudo az aks install-cli

              az aks get-credentials --name ${{ parameters.aksName }} \
                --resource-group ${{ parameters.aksRg }} \
                --subscription ${{ parameters.aksSubscription }} \
                --overwrite-existing \
                --file .kubeconfig-${{ parameters.aksName }}

              export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}

              # Set default namespace
              kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
              kubectl config get-contexts

              # Pass kubeconfig to kubelogin to access k8s API
              kubelogin convert-kubeconfig -l azurecli

              # Rollback the deployment
              kubectl rollout undo deployment/${{ parameters.deploymentName }}

      - job: AksRollbackHealthCheck
        displayName: ${{ parameters.environment}} — K8S Rollback Health
        dependsOn: Rollback
        condition: eq(dependencies.Rollback.result, 'Succeeded')
        steps:
        - checkout: self
          displayName: Checkout
          fetchTags: false
        - task: AzureCLI@2
          displayName: K8S Rollback Health
          inputs:
            azureSubscription: ${{ parameters.azureServiceConnection}}
            scriptType: bash
            scriptLocation: inlineScript
            inlineScript: |
              set -e # fail when script is failing
              sudo az aks install-cli

              az aks get-credentials --name ${{ parameters.aksName }} \
                --resource-group ${{ parameters.aksRg }} \
                --subscription ${{ parameters.aksSubscription }} \
                --overwrite-existing \
                --file .kubeconfig-${{ parameters.aksName }}

              export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }}

              # Set default namespace
              kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }}
              kubectl config get-contexts

              # Pass kubeconfig to kubelogin to access k8s API
              kubelogin convert-kubeconfig -l azurecli

              # Check the rollout status of the deployment
              kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}}

      - job: Clean
        displayName: Clean Up
        dependsOn: [ AksDeploymentHealthCheck, ManualApprovalOfRollBack, Rollback ]
        condition: always() 
        steps:
        - checkout: none
        - script: |
            rm -rf ~/.kube/config
          displayName: Remove kube config

Enter fullscreen mode Exit fullscreen mode

Template behaviour:

Image description

Image description

Top comments (0)