If you leverage the cloud for your bioinformatics compute Nextflow is a great tool that allows you to develop and test your workflow locally, and then seamlessly scale-up your analysis when local resources become limiting. When your cloud environment is all setup Nextflow does a great job of hiding the nitty-gritty details of coordinating the execution of complex workflows. Unfortunately, there are a non-trivial number of things to configure before you can run a workflow on AWS Batch. This can make leveraging the cloud seem unapproachable for first-time users. If you are a customer of Seqera you can leverage their Tower Forge service to stamp out Nextflow-compatible compute infrastructure in your cloud environment. If you do not use Tower Forge, or you need more control over the compute environment than Tower Forge provides, it is possible to set up a Nextflow-compatible compute environment using standard tools for managing cloud infrastructure.

One of the most fiddly pieces of setting up an AWS Batch compute environment is providing an AMI that is compatible with both AWS Batch and Nextflow. To be compatible with AWS Batch the AMI has to have Docker and special AWS Batch system packages installed. For Nextflow it is very convenient to have the AWS CLI package installed on the host AMI so that you can use Docker containers that were not built to target AWS Batch. The most annoying part of the process is that it is time consuming. Launching an EC2, ssh’ing in, installing system packages and software can easily take 30 minutes, and the snapshot action often takes another 10-15 minutes. If devops is just one of your many hats this context switching is disruptive and unpleasant. Unpleasant processes tend to take longer or are rarely prioritized, and workflows end up being run on infrastructure that is not a good fit. This can lead too poor performance/cost efficiency, and greatly inflate cloud spend.

Thankfully, the good folks at HashiCorp have created freely available tools that makes the process of building the exact AMI needed for a workflow much less painful. By leveraging Packer the whole process of building an AMI to use with your Nextflow pipeline can be completely automated. Making it easy to spin up new infrastructure promotes both agility and stability. Lowering the bar to tweak workflow infrastructure facilitates iterative improvement, creating test environments, and trying out different configurations. This makes it easier to make side-by-side comparisons of the cost efficiency/performance characteristics of different configurations, and identify cost effective ways to run a workflow. At the same time, automating the infrastructure creation process and embracing infrastructure-as-code makes it practical to deploy workloads into isolated compute environments. This isolation promotes workflow stability because infrastructure changes can be made to optimize the performance one workflow without having rippling effects on other production workflows.

Building an AMI with Packer

The idea behind Packer is to take the steps that you would perform when manually building an AMI, and instead specify them as code. When the packer command is run the application will (using your authorization) launch the desired EC2 instance, connect to it via SSH, and run through the AMI configuration and snapshot process.

The top-level piece of this specification is the “.pkr.hcl” configuration. The This file specifies details necessary for Packer to build the AMI, for example the source image it should start from, the type of instance it should use for the build, the security group the instance should be associated with, etc. The Packer script is straight-forward, in the “source” section information identifying the AMI you want to base the new image off of and how the EC2 will be run is specified.

# nextflow_ami.pkr.hcl

source "amazon-ebs" "aws-batch-nextflow-ami" {
  ami_name                    = "nextflow_ami"
  source_ami                  = "ami-12345678909876543"
  associate_public_ip_address = "true"
  instance_type               = "r4.large"
  region                      = "us-east-1"
  security_group_ids          = ["sg-98765432"]
  ssh_timeout                 = "5m"
  ssh_username                = "ec2-user"
  subnet_id                   = "subnet-abcdefgh"
  vpc_id                      = "vpc-hgfedcab"
}

build {
  sources = ["source.amazon-ebs.aws-batch-nextflow-ami"]

  provisioner "shell" {
    script = "install-awsbatch-and-nextflow-dependencies.sh"
    expect_disconnect = true
  }

}

The “build” section of the “.pkr.hcl” file points to a shell script that can be used to specify the commands that can be used to prepare your AMI from a base image.

The provisioner script includes the commands to install Docker, ECS system packages, Conda, and AWS CLI:

# install-awsbatch-and-nextflow-dependencies.sh

#!/bin/sh

# Install latest updates
sudo yum update -y

# Install dependencies
sudo yum install -y bzip2 wget

# Set up Docker environment for AWS Batch
sudo amazon-linux-extras install docker
sudo yum install docker
sudo service docker start
sudo usermod -a -G docker ec2-user

# Set up ECS environment for AWS Batch
sudo amazon-linux-extras disable docker
sudo amazon-linux-extras install -y ecs
sudo systemctl enable --now ecs

# Install Conda
sudo mkdir -p /nextflow_awscli/bin

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo bash /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh -b -f -p /nextflow_awscli

# Install aws cli via Conda
sudo /nextflow_awscli/bin/conda install -c conda-forge -y awscli

rm Miniconda3-latest-Linux-x86_64.sh

# Ensure usermod changes take effect before AMI snapshot
shutdown -r now

The latest Amazon Linux image works well with Nextflow and AWS Batch, and that is what is used for this example, but the installation script can be adapted if a different linux distribution is used. As described in the Nextflow’s Documentation, one critical detail when preparing a custom AMI to use on AWS Batch with Nextflow is that AWS CLI tools must be installed using Conda. When Nextflow uses AWS CLI installed on the host AMI the executable is “mounted” into the container environment. If AWS CLI is installed following the instructions in the AWS documentation the aws executable will not function correctly when it is used from inside a Docker container because it is linked to libraries that are not properly mounted in the container by Nextflow’s aws.batch.cliPath configuration. The Conda packaging installs an aws binary that is more portable/self-contained, and can be used within a Docker container.

Building the AMI

Once the details of the AMI build task have been specified performing the build is as simple as issueing the command:

packer build nextflow_ami.pkr.hcl

For this command to run successfully, your shell must be authorized to launch, connect to, and snapshot an AMI in the target accoutn. This can be achieved by defining an IAM role with the appropriate permissions, and installing credentials to access that role in a shared AWS credentials file.

Using AWS CLI from Nextflow

In the AMI creation script described above AWS CLI tools are installed to the directory “/nextflow_awscli” on the host AMI. With this installation path Nextflow can be instructed to use AWS CLI installed on the host AMI by adding the following configuration to your Nextflow configuration:

// nextflow.config

aws {
    batch {
        cliPath = '/nextflow_awscli/bin/aws'
    }
}

Summary

If you are running Nextflow workflows on AWS Batch having an AMI with AWS CLI installed is essential to simplify the management of containerized workflows. While Nextflow processes can technically be run on AWS Batch if the Docker container used to run the process provides an AWS CLI installation, most community managed images (e.g. BioContainers ) do not include AWS CLI. By pushing the responsibility for AWS CLI installation to the AMI community Docker images can be more readily deployed to AWS Batch. This separation also helps keep custom Docker images for in-house tools more lean. The Dockerfile can focus encapsulating the dependencies of the bioinformatics tool that will be run, without being contaminated with concerns about how the container will run with Nextflow on AWS Batch.

Cloud bioinformatics workflows can be quite data-intensive. If you are moving gigabytes of data between AWS Batch and S3 you should consider leveraging high-performance instance storage to avoid a frequently unappreciated killer for cloud compute efficiency.

References

Nextflow’s AWS Batch Documentation

AWS ECS Documentation

Comments