Configuring EC2 instance storage for genomics workflows

When deploying a genomics pipeline on AWS one of the biggest corners to stub your toes on is that Amazon meters EBS disk I/O. This can lead to strange situations where a pipeline that runs in a matter of minutes on local infrastructure can take 10-100x longer when run in the cloud. The reason this happens is because each EBS volume has an “allowance” of read/write credits, and as your pipeline runs (copying files back and forth from s3 or reading large genome index files) the credits are exhausted. As the instance runs it accumulates more credits, and when deploying a pipeline using EBS volumes the challenge is to not over-provision so that you are not being billed for unused disk-space, while at the same time avoiding the situation where I/O credits become exhausted and jobs become I/O-bottlenecked and hang.

While AWS has been kind enough to spell out the EBS price vs performance tradeoff in excruciating detail, a nice trick to avoid the complexities of EBS storage and IO limits is to leverage more cost effective and performant instance storage. Instance storage volumes are built-in to certain instance types (as opposed to being elastically provisioned), and the size, number, and type of volume associated with each instance type is described here. Instance storage tends to be a great match for genomics workflows because it is fast (hundreds to thousands of megabytes per second) and does not have an AWS-imposed throttle on I/O throughput. However, there is one gotcha: instance volumes are not formatted or mounted when you provision an EC2. This means it is not possible to leverage instance storage without additional configuration.

Conveniently, AWS launch templates provide a straight-forward way to prepare instance volumes and make them available to your pipeline’s compute environment. Below is a script for an Amazon Linux OS that formats and creates a single logical volume mounted at “/instance_drive” out of two NVMe instance volumes using LVM.

                                    #!/bin/bash

# register the instance-storage devices with LVM
pvcreate /dev/nvme1n1
pvcreate /dev/nvme2n1

# create a volume-group to expose the two devices as one drive
vgcreate ephemeral /dev/nvme1n1 /dev/nvme2n1
# allocate all the space one logical volume
lvcreate -n ephemeral_volume -l 100%FREE ephemeral

# format and mount the instance storage
mkfs -t ext4 /dev/ephemeral/ephemeral_volume
mkdir /instance_volume/
mount /dev/ephemeral/ephemeral_volume /instance_volume

                                

The details of the device IDs that should be used will vary by the instance type used, but the guide here describes the commands to determine block device IDs.

If your intention is to use instance volumes to run a workflow in a Docker-containerized environment (e.g. using Nextflow) you will want to instruct Docker to use the instance storage volume by setting the contents of /etc/docker/daemon.json to:

                                    {
  "data-root": "/instance_volume/docker_root"
}

                                

and then initializing this location and reloading the docker agent:

                                    mkdir /instance_volume/docker_root

systemctl daemon-reload
systemctl restart docker

To instuct AWS to mount your instance volumes when lauching an EC2 instance this script can be provided as user data. I like to use terraform to describe how I want my infrastructure provisioned because it is concise and readable way to describe cloud resources. Focusing only on the aspect of adding a launch template to an existing specification for an AWS Batch queue, a minimal terraform project layout would look like this:

                                    .
├── nextflow_aws_batch.tf
├── etc_docker_daemon.json
└── user_data_script.sh

                                

In the following terraform details that are specific to the account (AMI, VPC, and security group IDs) are omitted with “…” and should be specified to identify resources in your account. The user_data_script.sh script used to configure the instance storage is written for an Amazon Linux base AMI and may need to be adjusted in an alternative linux distribution is used. The script also assumes that the instance type used in the compute environment provides two NVMe storage volumes, and commands used to format the drives may need to be adjusted depending on your target instance type.

                                    # nextflow_aws_batch.tf

resource "aws_batch_job_queue" "this" {
  compute_environments = [
    aws_batch_compute_environment.this.arn,
  ]
}

resource "aws_batch_compute_environment" "this" {

  service_role = ...
  type = "MANAGED"

  compute_resources {

    instance_role = ...

    subnets = [ ... ]

    type = "EC2"

    instance_type = ...

    launch_template {
      launch_template_id = aws_launch_template.this.id
      version = aws_launch_template.this.latest_version
    }

  }

}

resource "aws_launch_template" "this" {
  image_id = ...

  vpc_security_group_ids = [ ... ]

  user_data = data.cloudinit_config.this.rendered
}

data "cloudinit_config" "this" {
  gzip          = false
  base64_encode = true

  part {
    content_type = "text/cloud-config"
    filename     = "cloudconfig.conf"
    content      = <<-EOF
      #cloud-config
      write_files:
        - path: "/etc/docker/daemon.json"
          permissions: "0600"
          owner: "root:root"
          encoding: "b64"
          content: ${filebase64("${path.module}/etc_docker_daemon.json")}
    EOF
  }

  part {
    content_type = "text/x-shellscript"
    content  = data.template_file.this.rendered
  }
}

data "template_file" "this" {
  template = file("${path.module}/user_data_script.sh")
}

                                

Where the user_data_script.sh contains:

                                    #!/bin/bash

# register the instance-storage devices with LVM
pvcreate /dev/nvme1n1
pvcreate /dev/nvme2n1

# create a volume-group to expose the two devices as one drive
vgcreate ephemeral /dev/nvme1n1 /dev/nvme2n1
# allocate all the space one logical volume
lvcreate -n ephemeral_volume -l 100%FREE ephemeral

# format and mount the instance storage
mkfs -t ext4 /dev/ephemeral/ephemeral_volume
mkdir /instance_volume/
mount /dev/ephemeral/ephemeral_volume /instance_volume

# prepare location where docker data will be stored
mkdir /instance_volume/docker_root

# reload docker to apply changes from /etc/docker/daemon.json
systemctl daemon-reload
systemctl restart docker

                                

and etc_docker_daemon.json contains:

                                    {
  "data-root": "/instance_volume/docker_root"
}

                                

If you are not ready to make the jump to Terraform or CloudFormation, the user data script can also be configured manually through the AWS console.

Comments