W. Lee Pang, PhD

Posted on Apr 26, 2022

Profiling workflows with the Amazon Genomics CLI

#aws #bioinformatics #genomics

Profiling is an essential part of developing genomics workflows. By identifying and eliminating expensive bottlenecks, users can make workflows performant and cost efficient. A common way to profile a workflow is to generate a timing chart of all tasks executed. The Amazon Genomics CLI (AGC) provides a unified experience for running workflows across multiple workflow engines. In doing so, it also enables a common way to generate timing charts and profile workflows. In this blog I’ll show you how this can be done with AGC and a little Python coding.

As a prerequisite, you’ll need to install AGC and activate your AWS account region for running workflows. This is covered in AGC’s documentation.

To run a workflow with AGC you use commands like the following - in this case I’m running the GATK data processing workflow that is included as an example when you install AGC:

cd ~/amazon-genomics-cli/examples/gatk-best-practices-project

agc context deploy onDemandCtx
agc workflow run -c onDemandCtx gatk4-data-processing

# returns a workflow run-id like:
# 20ceffe1-9664-4f41-a5e9-66ab5eaf57ac

Once the workflow completes, you can get task execution information from the workflow logs using:

agc logs workflow gatk4-data-processing -r 20ceffe1-9664-4f41-a5e9-66ab5eaf57ac

which looks like:

RunId: 20ceffe1-9664-4f41-a5e9-66ab5eaf57ac
State: COMPLETE
Tasks: 
    Name                        JobId                   StartTime               StopTime                ExitCode
    HaplotypeCallerGvcf_GATK4.MergeGVCFs        12d27ad2-df4e-44f7-90ba-6140bde14cf0    2022-03-24 17:50:06.575 +0000 UTC   2022-03-24 18:00:34.439 +0000 UTC   0
    HaplotypeCallerGvcf_GATK4.HaplotypeCaller   39e61999-079a-4671-9625-9a1fada62c6b    2022-03-24 17:29:12.955 +0000 UTC   2022-03-24 17:48:22.475 +0000 UTC   0
    HaplotypeCallerGvcf_GATK4.HaplotypeCaller   ce9a6f75-9c9d-4e55-b4f9-1f46c516a3b5    2022-03-24 17:29:12.961 +0000 UTC   2022-03-24 17:47:25.444 +0000 UTC   0
...

The output from agc logs workflow provides start and stop times for each task run by the workflow. We can use this to generate a simple Gantt chart of the workflow tasks. After dropping the above output into a file, you can parse out the tasks with the following Python code:

#!/bin/env python3

from datetime import datetime
import re

DT_FMTS = [
    '%Y-%m-%dT%H:%M:%SZ', '%Y-%m-%dT%H:%M:%S.%fZ', 
    '%Y-%m-%dT%H:%M:%S+00:00', '%Y-%m-%dT%H:%M:%S.%f+00:00',
    '%Y-%m-%d %H:%M:%S.%f %z %Z', '%Y-%m-%d %H:%M:%S %z %Z',
]

def parse_datetime(s, fmts = DT_FMTS):
    for fmt in fmts:
        try:
            dt = datetime.strptime(s, fmt)
            return dt
        except ValueError as e:
            pass
    else:
        raise ValueError(f"unable to parse datetime from '{s}'")


def parse_agc_log(log_file):
    with open(log_file, 'r') as f:
        lines = f.readlines()
        lines = list(filter(lambda x: x.strip(), lines))

    data = {
        k: lines[i].split(":")[1].strip()
        for i, k in enumerate(['runid', 'state'])
    }

    data['tasks'] = [
        dict(zip(
            ['name', 'job_id', 'start_time', 'end_time', 'exit_code'], 
            re.split('\t+', line.strip())
        )) 
        for line in lines[4:]
    ]

    for i, task in enumerate(data['tasks']):
        for k in ('start_time', 'end_time'):
            task[k] = parse_datetime(task[k])
        data['tasks'][i] = task

    data['tasks'] = sorted(data['tasks'], key=lambda x: (x['start_time'], x['name']))

    return data

Using the parse_agc_log() function above looks like:

parse_agc_log('logs/gatk4-data-processing__20ceffe1-9664-4f41-a5e9-66ab5eaf57ac.agc.log')

which will return something like:

{'runid': '20ceffe1-9664-4f41-a5e9-66ab5eaf57ac',
 'state': 'COMPLETE',
 'tasks': [{'name': 'HaplotypeCallerGvcf_GATK4.MergeGVCFs',
   'job_id': '26fcfea4-cfbc-4f04-95ef-115eb309f3df',
   'start_time': datetime.datetime(2022, 4, 11, 1, 55, 53, 734000, tzinfo=datetime.timezone(datetime.timedelta(0), 'UTC')),
   'end_time': datetime.datetime(2022, 4, 11, 2, 6, 40, 418000, tzinfo=datetime.timezone(datetime.timedelta(0), 'UTC')),
   'exit_code': '0'},
  {'name': 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller',
   'job_id': 'a9b3e5f1-cbd8-482b-87ba-df34e0252571',
   'start_time': datetime.datetime(2022, 4, 11, 1, 36, 5, 426000, tzinfo=datetime.timezone(datetime.timedelta(0), 'UTC')),
   'end_time': datetime.datetime(2022, 4, 11, 1, 52, 42, 457000, tzinfo=datetime.timezone(datetime.timedelta(0), 'UTC')),
   'exit_code': '0'},
   ...

This works to parse workflow tasks for most cases. However, when the task name is long it can chomp the separator between the task name and the task job-id. For example:

# ok line
    PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment    7e244b7b-e02d-4bb9-bee7-9005204382f3    2022-03-24 10:17:20.673 +0000 UTC   2022-03-24 10:23:00.67 +0000 UTC    0

# bad line
    PreProcessingForVariantDiscovery_GATK4.CreateSequenceGroupingTSV85c5a61b-867e-49ed-a11d-951910444c03    2022-03-24 10:05:53.583 +0000 UTC   2022-03-24 10:11:55.023 +0000 UTC   0

This makes it difficult for steps we want to do later that involve the job-id. The JSON formatter for agc is not currently used by the agc logs workflow command. I created an issue to address this to help the team prioritize it. I could probably solve this in the short term with some regular expressions, but I’d rather not. There is another way to consistently get machine parseable output.

Under the hood, agc uses the GA4GH Workflow Execution Schema (WES) to communicate with AWS resources to run workflows. Each AGC context deploys a GA4GH WES endpoint which you can send requests to directly. The endpoint requires AWS SigV4 signed requests, so the easiest way to talk to it is with a tool like awscurl, which does the request signing for you.

To get the WES endpoint for a specific context you can use the agc context describe command:

agc context describe onDemandCtx --format json

which returns something like:

{
    "Name": "onDemandCtx",
    "Status": "STARTED",
    "StatusReason": "",
    "MaxVCpus": 256,
    "RequestSpotInstances": false,
    "InstanceTypes": null,
    "Output": {
        "Url": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtx"
    },
    "WesEndpoint": {
        "Url": "https://yrtxeqnhkb.execute-api.us-east-1.amazonaws.com/prod/"
    }
}

Note the WES endpoint Url needs to have “ga4gh/wes/v1” appended to it. We can extract this value using jq.

wes=$(agc context describe onDemandCtx --format json | jq -r '.WesEndpoint.Url + "ga4gh/wes/v1"')

We can now use the WesEndpoint to send a GA4GH WES GetRunLog request for the workflow:

awscurl ${wes}/runs/20ceffe1-9664-4f41-a5e9-66ab5eaf57ac

which will return something like the following:

{
  "outputs": {
    "id": "20ceffe1-9664-4f41-a5e9-66ab5eaf57ac",
    "outputs": {
      "HaplotypeCallerGvcf_GATK4.output_vcf": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-MergeGVCFs/NA12878.g.vcf.gz",
      "HaplotypeCallerGvcf_GATK4.output_vcf_index": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-MergeGVCFs/NA12878.g.vcf.gz.tbi"
    }
  },
  "request": {
    "workflow_params": {},
    "workflow_type": "WDL",
    "workflow_type_version": "1.0"
  },
  "run_id": "0612738f-06cd-4d96-9438-a69c432dc59e",
  "state": "COMPLETE",
  "task_logs": [
    {
      "cmd": [
        "set -e\n\n/gatk/gatk --java-options \"-Xmx2G\"  \\\nMergeVcfs \\\n--INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-0/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-1/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-2/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-3/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-4/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-5/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-6/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-7/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-733263974272-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-8/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-9/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-10/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-11/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-12/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-13/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-14/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-15/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-16/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-17/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-18/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-19/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-20/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-21/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-22/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-23/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-733263974272-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-24/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-25/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-26/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-27/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-28/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-29/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-30/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-31/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-32/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-33/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-34/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-35/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-36/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-37/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-38/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-39/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-40/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-41/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-42/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-43/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-44/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-45/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-46/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-47/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-48/NA12878.g.vcf.gz --INPUT /cromwell_root/agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-49/NA12878.g.vcf.gz \\\n--OUTPUT NA12878.g.vcf.gz"
      ],
      "end_time": "2022-04-11T02:06:40.418Z",
      "exit_code": "0",
      "name": "HaplotypeCallerGvcf_GATK4.MergeGVCFs|26fcfea4-cfbc-4f04-95ef-115eb309f3df",
      "start_time": "2022-04-11T01:55:53.734Z",
      "stderr": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-MergeGVCFs/MergeGVCFs-stderr.log",
      "stdout": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-MergeGVCFs/MergeGVCFs-stdout.log"
    },
    {
      "cmd": [
        "set -e\n\n/gatk/gatk --java-options \"-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10\" \\\nHaplotypeCaller \\\n-R /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \\\n-I s3://agc-733263974272-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-CramToBamTask/NA12878.bam \\\n-L /cromwell_root/gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals/temp_0001_of_50/scattered.interval_list \\\n-O NA12878.g.vcf.gz \\\n-contamination 0 \\\n-G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation \\\n-GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \\\n-ERC GVCF \\\n \\\n\n\n# Cromwell doesn't like optional task outputs, so we have to touch this file.\ntouch NA12878.g.vcf.gz.bamout.bam"
      ],
      "end_time": "2022-04-11T01:52:42.457Z",
      "exit_code": "0",
      "name": "HaplotypeCallerGvcf_GATK4.HaplotypeCaller|a9b3e5f1-cbd8-482b-87ba-df34e0252571",
      "start_time": "2022-04-11T01:36:05.426Z",
      "stderr": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-0/HaplotypeCaller-0-stderr.log",
      "stdout": "s3://agc-111122223333-us-east-1/project/GATKwdl/userid/user33IfmN/context/onDemandCtxCromwell1/cromwell-execution/HaplotypeCallerGvcf_GATK4/0612738f-06cd-4d96-9438-a69c432dc59e/call-HaplotypeCaller/shard-0/HaplotypeCaller-0-stdout.log"
    },
    ...
  ],
  ...
}

If we capture this output to a file, we can process it with the following Python code:

#!/bin/env python3

import json

def parse_wes_runlog(log_file):
    with open(log_file, 'r') as f:
        runlog = json.load(f)
        tasks = runlog['task_logs']

    if runlog.get('run_log'):
        for prop in ('start_time', 'end_time'):
            runlog["run_log"][prop] = parse_datetime(runlog["run_log"][prop])

    for i, task in enumerate(tasks):
        for k in ('start_time', 'end_time'):
            tasks[i][k] = parse_datetime(task[k])

        name, job_id = task['name'].split("|")
        tasks[i]['job_name'] = name
        tasks[i]['job_id'] = job_id

    runlog['task_logs'] = sorted(tasks, key=lambda x: (x['start_time'], x['name']))

    return runlog

Now that we can get task timing information in a consistent manner, let’s do some plotting. For this, I’m going to use Bokeh which generates nice interactive plots.

#!/bin/env python3

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, output_notebook, show

TIME_SCALE_FACTORS = {
    "sec": 1, "min": 1/60, "hr": 1/3600
}

def plot_tasks_bokeh(tasks, time_units='min'):
    """plots task timing using engine reported timing (wes runlog)"""
    time_scale_factor = TIME_SCALE_FACTORS[time_units]

    tare = min([task['start_time'] for task in tasks])
    wall_time = (max([task['end_time'] for task in tasks]) - tare).total_seconds() * time_scale_factor

    stats = {
        'num_tasks': len(tasks),
        'timing': {
            'wallclock': wall_time,
            'units': time_units
        }
    }

    stats_text = (
        f"tasks: {stats['num_tasks']}, "
        f"wall_time: {stats['timing']['wallclock']:.2f} {stats['timing']['units']}"
    )

    source = ColumnDataSource(data=dict(
        y=range(len(tasks)),
        names=[task['job_name'] for task in tasks],
        running_left=[(task['start_time'] - tare).total_seconds() * time_scale_factor for task in tasks],
        running_right=[(task['end_time'] - tare).total_seconds() * time_scale_factor for task in tasks],
        text_x=[((task['end_time'] - tare).total_seconds() + 30) * time_scale_factor for task in tasks],
    ))

    output_notebook()

    p = figure(width=1280, height=800)
    p.hbar(y='y', left='running_left', right='running_right', height=0.8, source=source)
    p.text(x='text_x', y='y', text='names', alpha=0.4, text_baseline='middle', text_font_size='1.5ex', source=source)

    p.y_range.flipped = True
    p.xaxis.axis_label = f"task execution time ({time_units})"
    p.title.text = stats_text

    show(p)

    return p

Using this plotting function:

#!/bin/env python3
runlog = parse_wes_runlog()
p = plot_tasks_bokeh(runlog['task_logs'])

we get a plot that looks like this:

This gives us a great view of what order tasks are run in, how long each task takes, how many shards a parallel task creates, and how each task contributes to the overall workflow run time. This is a good start, but we can go deeper.

AGC schedules workflow tasks to AWS Batch. Looking at each task name, we see that AWS Batch job-ids are encoded, which we’ve already split out in parse_wes_runlog() as job_name and job_id. Using these job-ids, we can query AWS Batch for additional job details. To do that we’ll add the following Python functions to our script:

#!/bin/env python3

import boto3
from concurrent.futures import ThreadPoolExecutor, as_completed

def get_job_descriptions(batch, job_ids):
    jobs = []
    job_ids_sets = chunks(job_ids, 100)

    with ThreadPoolExecutor(max_workers=10) as executor:
        future_jobs = {
            executor.submit(batch.describe_jobs, jobs=job_ids_set): job_ids_set
            for job_ids_set in job_ids_sets
        }

        for future in as_completed(future_jobs):
            try:
                response = future.result()
                jobs += response["jobs"]
            except Exception as e:
                print(f"error retrieving job descriptions: {e}")

    descriptions = {job['jobId']:job for job in jobs}
    return descriptions


def chunks(l, n):
    """split list l into chunks of size n"""
    for i in range(0, len(l), n):
        yield l[i : i + n]

These functions help to gather AWS Batch job descriptions concurrently in groups of 100 to accommodate the limits of the AWS Batch GetJobDescriptions API call.

We can now update our parse_wes_runlog() function to include retrieving the corresponding AWS Batch job description for each task by adding:

def parse_wes_runlog(log_file, batch=boto3.client('batch')):
    # ...
    descriptions = get_job_descriptions(batch, [task['job_id'] for task in tasks])
    for i, task in enumerate(tasks):
        tasks[i]['job_description'] = descriptions.get(task['job_id'])
    # ...

This gives us the following additional metrics for each task:

The time interval between when the job was created in AWS Batch to when the job actually started running. This is the amount of time the job spent waiting in a job queue.
The completion status for each task - i.e. “SUCCEEDED” or “FAILED”.
The type and amount of compute resources the job requested, specifically number of vCPUs and GBs of memory

Note: AWS Batch only stores job information for 48hrs after it has reached a “completed” state (e.g. SUCCEEDED or FAILED). It is a good idea to get this information as soon as you can and cache them somewhere.

Combining task start and stop times with vCPU and memory details, you can get a rough estimate of how much each task costs. If you go to the AWS EC2 console, you can export a table that contains instance types and their Linux On-demand per hour cost. Doing this for non-(a, d, n) C, M, and R instances types with x86_64 architecture, and fitting a multilinear model to the data, I get the following formula which I’ve encapsulated in a Python function:

#!/bin/env python3

def approx_instance_cost_per_hr(vcpu, mem_gib):
    # rough OLS fit model based on c, m, r instance pricing filtered for x86_64 and non-(a,d,n)
    # returns linux on-demand per hour cost in USD
    usd = 0.03550716*vcpu + 0.003633091*mem_gib + 0.00718333

    return usd

Note: AWS prices vary by region and are updated regularly, so your model constants may differ.

We can now update our plotting function in the following ways. First we’ll add more data to our ColumnDataSource:

    millis_to_sec = 1/1000
    mebi_to_gebi = 1/1024

    jobs = [task['job_description'] for task in tasks]
    tare = min([job['createdAt'] for job in jobs])

    colors={'SUCCEEDED': 'cornflowerblue', 'FAILED': 'orangered'}

    source = ColumnDataSource(data=dict(
        y = range(len(jobs)),
        names=[job['jobName'] for job in jobs],
        color=[colors[job['status']] for job in jobs],
        cost=[
            approx_instance_cost_per_hr(
                get_job_resources(job)['vcpus'], 
                get_job_resources(job)['memory'] * mebi_to_gebi
            ) * ((job['stoppedAt'] - job['startedAt']) * millis_to_sec / 3600)
            for job in jobs],
        cpus=[get_job_resources(job)['vcpus'] for job in jobs],
        memory=[get_job_resources(job)['memory'] * mebi_to_gebi for job in jobs],
        queued_left=[(job['createdAt'] - tare) * millis_to_sec * time_scale_factor for job in jobs],
        queued_right=[(job['startedAt'] - tare) * millis_to_sec * time_scale_factor for job in jobs],
        running_left=[(job['startedAt'] - tare) * millis_to_sec * time_scale_factor for job in jobs],
        running_right=[(job['stoppedAt'] - tare) * millis_to_sec * time_scale_factor for job in jobs],
        text_x=[((job['stoppedAt'] - tare) * millis_to_sec + 30) * time_scale_factor for job in jobs],
    ))

A few things to note. AWS Batch reports times in millisecond timestamps and memory in Mebibytes, hence the need for the millis_to_sec and mebi_to_gebi conversion factors. I’ve added data columns to associate with task completion state (color) as well as data to show task time spent in queue (queued_left and queued_right).

Our main task timing plot is the same as before with the addition of an extra set of bars to show how long tasks were in queue:

p.hbar(y='y', left='queued_left', right='queued_right', height=0.8, color='lightgrey', source=source, legend_label="queued")

We can also add linked plots for per task cpu requested:

p_cpu = figure(width=160, y_range=p.y_range, sizing_mode="stretch_height")
p_cpu.hbar(y='y', right='cpus', height=0.8, color="darkgrey", source=source)
p_cpu.x_range = Range1d(-1, 32)
p_cpu.xaxis.axis_label = "vcpus"
p_cpu.title.text = f"max cpus: {max(source.data['cpus'])}"
p_cpu.yaxis.visible = False

per task memory requested:

p_mem = figure(width=160, y_range=p.y_range, sizing_mode="stretch_height")
p_mem.hbar(y='y', right='memory', height=0.8, color="darkgrey", source=source)
p_mem.x_range = Range1d(-1, 32)
p_mem.xaxis.axis_label = "memory (GiB)"
p_mem.title.text = f"max mem: {max(source.data['memory']):.2f} GiB"
p_mem.yaxis.visible = False

and per task cost estimates:

p_usd = figure(width=160, y_range=p.y_range, sizing_mode="stretch_height")
p_usd.hbar(y='y', right='cost', height=0.8, color='limegreen', source=source)
x_max = 0.1
x_min = -(x_max * 0.05)
p_usd.x_range = Range1d(x_min, x_max*1.05)
p_usd.xaxis.axis_label = "task cost (USD)"
p_usd.title.text = f"total: {total_cost:.2f} USD"

We can also compute some overall workflow summary statistics like total_cpu_time and importantly total_cost:

cpus = [get_job_resources(job)['vcpus'] for job in jobs]
total_cpu_time = sum([(job['stoppedAt'] - job['startedAt']) * cpu for job, cpu in zip(jobs, cpus)]) * millis_to_sec * time_scale_factor
total_cost = sum(source.data['cost'])

Pulling the above together using a Bokeh gridplot layout, we get a plot that looks like the following:

Note: The input for the above workflow is a small dataset used for testing and demo purposes. The overall cost shown is not representative of a workflow processing a full 30x human whole genome sample.

With the plot above, you can see at a glance which task(s) impact the workflow the most in terms of execution time and cost. You can also see where tasks might be spending too much time in queue waiting for available resources. In both cases, you now have some insight into how the workflow can be optimized - e.g by tuning individual task resource requirements or adjusting overall compute availability via the Max vCPUs allowed in AWS Batch Compute Environments via AGC’s context configuration.

Importantly, the method above can be used for any workflow that AGC can run so you don’t have to remember engine specific details of how to get such profiling information. This opens up the possibility of comparing the performance a workflow across workflow engines, which is something I’ll explore further in a future blog post. In the meantime, I recommend generating some plots of your own! All the Python code I used above is available in Github.

DEV Community

Profiling workflows with the Amazon Genomics CLI

Top comments (0)

Read next

AWS CodePipeline V2 type pipelines supports to automatically retry a stage if there is a failure in the stage.

🚀 Deploying Node.js Application with PM2, NGINX, and SSL Configuration 🚀

Smart File Organizer Using AWS Lambda & S3 - (Let's Build 🏗️ Series)

Building an AI-Powered iOS Chat App with Amazon Bedrock and Swift