DEV Community

Cover image for Ab Initio Automation: How We Reduced 80% of Incidents Due to Connection Failures
SOM4N
SOM4N

Posted on

Ab Initio Automation: How We Reduced 80% of Incidents Due to Connection Failures

Introduction

In my current role on a data integration team, we encountered frequent job failures caused by connection timeouts while processing Data across different servers, databases, teams, and Amazon S3 buckets using AbInitio. These issues not only disrupted our workflows but also required manual interventions, reducing the efficiency of the overall process.

In this blog, I’ll explain how I implemented an automated retry mechanism that resolved these issues, reduced manual interventions, and stabilized our processes.

The Problem: Connection Timeouts and Job Failures

Our daily tasks involved extracting data from various databases, performing transformations, and loading it back into different servers this workflow required seamless communication across multiple systems, but we consistently faced:

  • Connection timeouts: Due to network issues, some jobs failed to complete within the allotted time, causing interruptions in data processing.
  • Partial loads: When a job failed midway due to a connection issue, it would leave data partially loaded into tables, requiring the entire process to be restarted manually.
  • Manual interventions: Every time a job failed, the team had to manually re-trigger the job and ensure it restarted from the beginning or resolved any partial load problems .

The Solution: Automation with Retry Scripts

To address the frequent connection issues, I proposed the use of a retry script that automatically retries failed jobs a specified number of times until they successfully complete. This approach helped us avoid manual interventions, reducing downtime and improving the stability of the team’s workflow.


sandbox=$1
pset=$2
MAX_RETRIES=$3

while [ $attempt -lt $MAX_RETRIES ]; do
  echo "Running Ab Initio job..."
  air sandbox run $sandbox/$pset

  if [ $? -eq 0 ]; then
    echo "Job completed successfully"
    exit 0
  else
    attempt=$((attempt + 1))
    echo "Job failed, attempt $attempt of $MAX_RETRIES"

    if [ $attempt -lt $MAX_RETRIES ]; then
      sleep $RETRY_DELAY
    fi
  fi
done

Enter fullscreen mode Exit fullscreen mode

Key Points:

  • The script retries the job up to MAX_RETRIES times.
  • If the job fails, it waits for RETRY_DELAY seconds before retrying.
  • Upon success, the script exits. If all retries fail, the script stops and reports failure.

Key Benefits of the Retry Mechanism

  • 80% Reduction in On-Call Incidents: The automated retry mechanism drastically reduced the number of on-call incidents related to job failures caused by connection issues. The team no longer had to manually re-trigger jobs or deal with partial loads.
  • Process Stability: By automatically retrying jobs, our workflow became much more stable. The script handled intermittent connection problems seamlessly, allowing jobs to resume without intervention.
  • Improved Efficiency: With the retry logic and recovery mechanism, we avoided the inefficiency of reloading entire files from the beginning. The script resumed jobs from the failure point, improving overall performance.
  • Automation: Automation reduced the manual burden on the team, freeing up valuable time that could be spent on more strategic tasks. The need for urgent intervention at all hours was virtually eliminated.
  • Scalable Solution: This retry approach is not only effective for Ab Initio jobs but can also be applied to other ETL or data processing scenarios that suffer from connection-related failures
  • This solution can be applied to any ETL or data processing scenario where connection issues may arise, and it showcases how automation can drastically improve process reliability.

If you're facing similar issues in your ETL pipelines or workflows, consider implementing retry scripts tailored to your environment to overcome job failures due to transient connection issues. let me know on your expertise how we could have handled these issues better

python version of Automation

import time
import subprocess
import sys

# Constants
RETRY_DELAY = 30

def run_job(sandbox_path, pset_name):
    # Construct the command
    command = f"air sandbox run {sandbox_path}/{pset_name}"

    try:
        # Run the command using subprocess
        result = subprocess.run(command, shell=True, check=True)
        return result.returncode
    except subprocess.CalledProcessError as e:
        return e.returncode

def retry_job(sandbox_path, pset_name, max_retries):
    attempt = 0

    while attempt < max_retries:
        print(f"Running Ab Initio job... Attempt {attempt + 1} of {max_retries}")

        # Run the job
        return_code = run_job(sandbox_path, pset_name)

        if return_code == 0:
            print("Job completed successfully")
            return True
        else:
            attempt += 1
            print(f"Job failed, attempt {attempt} of {max_retries}")

            if attempt < max_retries:
                print(f"Retrying job after {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)
            else:
                print("Max retries reached. Job failed.")
                return False

if __name__ == "__main__":
    # Accept parameters from the command line
    sandbox_path = sys.argv[1]
    pset_name = sys.argv[2]
    max_retries = int(sys.argv[3])

    # Start the retry process
    retry_job(sandbox_path, pset_name, max_retries)

Enter fullscreen mode Exit fullscreen mode

calling this python script

python retrypset.py sandbox_path pset_name max_retries

Enter fullscreen mode Exit fullscreen mode

Top comments (0)