AMI for Nextflow on AWS Batch using Packer
If you leverage the cloud for your bioinformatics compute Nextflow is a great tool that allows you to develop and test your workflow locally, and then seamlessly scale-up your analysis when local resources become limiting. When your cloud environment is all setup Nextflow does a great job of hiding the nitty-gritty details of coordinating the execution of complex workflows. Unfortunately, there are a non-trivial number of things to configure before you can run a workflow on AWS Batch. This can make leveraging the cloud seem unapproachable for first-time users. If you are a customer of Seqera you can leverage their Tower Forge service to stamp out Nextflow-compatible compute infrastructure in your cloud environment. If you do not use Tower Forge, or you need more control over the compute environment than Tower Forge provides, it is possible to set up a Nextflow-compatible compute environment using standard tools for managing cloud infrastructure.
One of the most fiddly pieces of setting up an AWS Batch compute environment is providing an AMI that is compatible with both AWS Batch and Nextflow. To be compatible with AWS Batch the AMI has to have Docker and special AWS Batch system packages installed. For Nextflow it is very convenient to have the AWS CLI package installed on the host AMI so that you can use Docker containers that were not built to target AWS Batch. The most annoying part of the process is that it is time consuming. Launching an EC2, ssh’ing in, installing system packages and software can easily take 30 minutes, and the snapshot action often takes another 10-15 minutes. If devops is just one of your many hats this context switching is disruptive and unpleasant. Unpleasant processes tend to take longer or are rarely prioritized, and workflows end up being run on infrastructure that is not a good fit. This can lead too poor performance/cost efficiency, and greatly inflate cloud spend.
Thankfully, the good folks at HashiCorp have created freely available tools that makes the process of building the exact AMI needed for a workflow much less painful. By leveraging Packer the whole process of building an AMI to use with your Nextflow pipeline can be completely automated. Making it easy to spin up new infrastructure promotes both agility and stability. Lowering the bar to tweak workflow infrastructure facilitates iterative improvement, creating test environments, and trying out different configurations. This makes it easier to make side-by-side comparisons of the cost efficiency/performance characteristics of different configurations, and identify cost effective ways to run a workflow. At the same time, automating the infrastructure creation process and embracing infrastructure-as-code makes it practical to deploy workloads into isolated compute environments. This isolation promotes workflow stability because infrastructure changes can be made to optimize the performance one workflow without having rippling effects on other production workflows.
Building an AMI with Packer
The idea behind Packer is to take the steps that
you would perform when manually building an AMI,
and instead specify them as code. When the
packer
command is run the application will (using your
authorization) launch the desired EC2 instance,
connect to it via SSH, and run through the AMI
configuration and snapshot process.
The top-level piece of this specification is the “.pkr.hcl” configuration. The This file specifies details necessary for Packer to build the AMI, for example the source image it should start from, the type of instance it should use for the build, the security group the instance should be associated with, etc. The Packer script is straight-forward, in the “source” section information identifying the AMI you want to base the new image off of and how the EC2 will be run is specified.
# nextflow_ami.pkr.hcl
source "amazon-ebs" "aws-batch-nextflow-ami" {
ami_name = "nextflow_ami"
source_ami = "ami-12345678909876543"
associate_public_ip_address = "true"
instance_type = "r4.large"
region = "us-east-1"
security_group_ids = ["sg-98765432"]
ssh_timeout = "5m"
ssh_username = "ec2-user"
subnet_id = "subnet-abcdefgh"
vpc_id = "vpc-hgfedcab"
}
build {
sources = ["source.amazon-ebs.aws-batch-nextflow-ami"]
provisioner "shell" {
script = "install-awsbatch-and-nextflow-dependencies.sh"
expect_disconnect = true
}
}
The “build” section of the “.pkr.hcl” file points to a shell script that can be used to specify the commands that can be used to prepare your AMI from a base image.
The provisioner script includes the commands to install Docker, ECS system packages, Conda, and AWS CLI:
# install-awsbatch-and-nextflow-dependencies.sh
#!/bin/sh
# Install latest updates
sudo yum update -y
# Install dependencies
sudo yum install -y bzip2 wget
# Set up Docker environment for AWS Batch
sudo amazon-linux-extras install docker
sudo yum install docker
sudo service docker start
sudo usermod -a -G docker ec2-user
# Set up ECS environment for AWS Batch
sudo amazon-linux-extras disable docker
sudo amazon-linux-extras install -y ecs
sudo systemctl enable --now ecs
# Install Conda
sudo mkdir -p /nextflow_awscli/bin
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo bash /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh -b -f -p /nextflow_awscli
# Install aws cli via Conda
sudo /nextflow_awscli/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
# Ensure usermod changes take effect before AMI snapshot
shutdown -r now
The latest Amazon Linux image works well with
Nextflow and AWS Batch, and that is what is used
for this example, but the installation script
can be adapted if a different linux distribution
is used. As described in the
Nextflow’s Documentation, one critical detail when preparing a custom
AMI to use on AWS Batch with Nextflow is that
AWS CLI tools must be installed using
Conda. When Nextflow uses AWS CLI installed on
the host AMI the executable is “mounted” into
the container environment. If AWS CLI is
installed following the instructions in the
AWS documentation
the
aws
executable will not function correctly when it
is used from inside a Docker container because
it is linked to libraries that are not properly
mounted in the container by Nextflow’s
aws.batch.cliPath
configuration. The Conda packaging installs an
aws
binary that is more portable/self-contained, and
can be used within a Docker container.
Building the AMI
Once the details of the AMI build task have been specified performing the build is as simple as issueing the command:
packer build nextflow_ami.pkr.hcl
For this command to run successfully, your shell must be authorized to launch, connect to, and snapshot an AMI in the target accoutn. This can be achieved by defining an IAM role with the appropriate permissions, and installing credentials to access that role in a shared AWS credentials file.
Using AWS CLI from Nextflow
In the AMI creation script described above AWS CLI tools are installed to the directory “/nextflow_awscli” on the host AMI. With this installation path Nextflow can be instructed to use AWS CLI installed on the host AMI by adding the following configuration to your Nextflow configuration:
// nextflow.config
aws {
batch {
cliPath = '/nextflow_awscli/bin/aws'
}
}
Summary
If you are running Nextflow workflows on AWS Batch having an AMI with AWS CLI installed is essential to simplify the management of containerized workflows. While Nextflow processes can technically be run on AWS Batch if the Docker container used to run the process provides an AWS CLI installation, most community managed images (e.g. BioContainers ) do not include AWS CLI. By pushing the responsibility for AWS CLI installation to the AMI community Docker images can be more readily deployed to AWS Batch. This separation also helps keep custom Docker images for in-house tools more lean. The Dockerfile can focus encapsulating the dependencies of the bioinformatics tool that will be run, without being contaminated with concerns about how the container will run with Nextflow on AWS Batch.
Cloud bioinformatics workflows can be quite data-intensive. If you are moving gigabytes of data between AWS Batch and S3 you should consider leveraging high-performance instance storage to avoid a frequently unappreciated killer for cloud compute efficiency.
Comments