Visualizing BAM files hosted on s3

While the end product of many bioinformatics analyses are tables and graphs being able to visually inspect the alignment of sequencing data is valuable during method development or to diagnose issues with the sequencing library. Two excellent tools to visualize sequencing data are IGV and JBrowse 2. When running these tools locally sequence alignments stored in BAM files can be readily visualized. However, because sequencing data is large, and it is often stored on a remote cloud file system. While this is practical, the inconvenience of downloading alignment files and installing visualization software can be so great that opportunities to make insights through data visualization can be easily missed.

Luckily, when hosting sequence alignments on S3 it is easy to make this data more accessible to users in your organization through their web browser.

Accessing Data via HTTP

While objects stored on S3 are typically referred to using their S3 URI ( e.g., s3://example/file.bam ), every s3 resource has can also be referred to by an HTTP URI. For example, a file stored on S3 at location s3://example/file.bam can be retrieved by HTTP using its Object URL https://s3.amazonaws.com/example/file.bam, and if bucket permissions allow public access the files can be readily downloaded using a web browser or command line tools like curl and wget. However, objects stored in an s3 bucket are private by default. Private objects can also be retrieved using HTTP using presigned URLs. Since the process of loading data in IGV via HTTP is the same using both signed and unsigned object URLs my example below uses objects stored in a publicly accessible bucket for simplicity.

CORS headers

In order for the IGV genome browser application the runs in the user’s browser to be able to retrieve sequencing data from S3 the server where the data is hosted has to be configured to allow “cross-origin” requests. By default web browsers prevent a web page from displaying content from different hosts. The Cross-origin resource sharing (CORS) protocols enables servers to allow exceptions to this default behavior to be made. In the case of igv.js we will use CORS to allow the JavaScript genome browser application to be hosted independently of the data that is displayed.

The igv.js wiki provides guidance on how the S3 bucket CORS headers can be configured to make them accessible to igv.js.

Basic IGV.js configuration

The configuration of igv.js is described in the Quickstart guide. A minimal setup to embed IGV in a webpage for a genome browser session with data assets hosted off of s3 would be

index.html:

                                    <html>
    <header>
        <!--1-->
        <script src="https://cdn.jsdelivr.net/npm/igv@2.15.8/dist/igv.min.js"></script>
    </header>

    <body>

        <h1>IGV</h1>

        <!--2-->
        <div id="igv-browser"/>

        <script>

            <!--3-->
            var myTrack = {
                "name": "my track",
                "url": "https://s3.amazonaws.com/1000genomes/file.bam",
                "indexURL": "https://s3.amazonaws.com/1000genomes/file.bam.bai",
                "format": "bam"
            }

            var options = {
                genome: "hg38",
                locus: "chr1:1-100",
                tracks: [ myTrack ]
            };

            <!--4-->
            var igvDiv = document.getElementById("igv-browser");
            igv.createBrowser(igvDiv, options)

        </script>
    </body>
</html>

                                

To walk through this file:

The source for igv.js is loaded from the developer-hosted content deliver network distribution for igv.js
A placeholder in the markup is created so that the IGV browser can be injected into the page (see step 4)
IGV is configured to specify the reference genome and tracks to be displayed
The browser is initialized and injected into the page

Combining Shiny with IGV-js

While configuring a genome browser to visualize data is a good first step, serving data off of S3 can really shine when a the browser is is embedded as a component in a data-rich web-application. To illustrate how genomics data hosted on s3 can be visualized I will create a small Shiny app that allows a user to select a dataset to visualize by selecting from a list:

Shiny is a framework for R to makes it easy to build web applications that provide a no-code interface for people to interact with data and take advantage of the language’s powerful statistics/machine learning/data visualization capabilities. Being able to cross-reference raw sequencing data from a Shiny application can streamline a process that would otherwise be cumbersome (locating the BAM file corresponding to the sample, copying it from remote storage to the local computer, opening IGV and configuring it to visualize the region of interest).

The process to wire IGV-js into a Shiny app is analogous to the approach used in the first example above: In the UI block the IGV-js library is loaded, a <div> is created for the browser, and using a script tag specify that the browser should be created when the page loads. In the server function a table of genomic regions is created with an observer so that when a row of the table is selected the IGV-js API is used to set the coordinate of the genome browser that is displayed.

To provide a high level overview, the app is composed of an R script and a little JavaScript:

                                    ├── app.R
└── www
    └── setup_igv.js

                                

With UI and server components defined:

                                    # file: 'app.R'
library(shiny)

ui <- fluidPage(
    tags$head(
        tags$script(src='setup_igv.js'),
        tags$script(src='https://cdn.jsdelivr.net/npm/igv@2.15.8/dist/igv.min.js')
    ),

    titlePanel("S3/IGV Demo"),

    fluidRow(
        column(5,
               tags$p('Select a row in the table to set IGV coordinates'),
               DT::DTOutput('regions')),
        column(7,
               div(id='igv-browser'))
    )
)

server <- function(input, output, session) {

    regions_table <-  tibble::tibble( region = c( 'chr17:43073832-43094782', 'chr13:32347043-32368733') )

    output$regions <- DT::renderDT( DT::datatable( regions_table, selection = 'single' ) )

    observeEvent( input$regions_rows_selected,{
        selected_region <- regions_table[input$regions_rows_selected,]$region
        session$sendCustomMessage('igv-search', selected_region)
    })
}

shinyApp(ui, server)


                                

Where the snippet used to configure the IGV browser is:

                                    // file: 'www/setup_igv.js'
document.addEventListener("DOMContentLoaded",function(){


    var s3_track = {
        name: "NA21144",
        url: "https://s3.amazonaws.com/1000genomes/data/NA21144/alignment/NA21144.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram",
        indexURL: "https://s3.amazonaws.com/1000genomes/data/NA21144/alignment/NA21144.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram.crai",
        format: "cram"
    }

    var options = {
        genome: "hg38",
        tracks: [s3_track]
    };

    var igvDiv = document.getElementById("igv-browser");
    browser = igv.createBrowser(igvDiv, options);

    Shiny.addCustomMessageHandler('igv-search',function(txt){
        browser.then(function(igvb){ igvb.search(txt) });
    });

})


                                

And the app can be run by calling:

Rscript -e 'shiny::runApp(".")'

Shiny / IGV-js communication

In order to use the IGV-js library from Shiny the Shiny UI needs to be able to control the behavior of the IGV browser. IGV-js provides an API to drive browser functions like setting the genome version, navigation, configuring tracks. To call these API methods we need to be able to “send messages” from the Shiny framework to JavaScript running on the client’s web browser. Shiny provides a mechanism to do this via the session$sendCustomMessage() function. By defining a new event type ‘igv-search’ we can send a request from Shiny to javascript running on the client with the coordinates of the region to be displayed:

                                    session$sendCustomMessage('igv-search', selected_region)

                                

This message is received by a message handler that listens for events with the specified type, and calls the appropriate IGV API method when the event occurs.

                                    Shiny.addCustomMessageHandler('igv-search',function(txt){
    browser.then(function(igvb){ igvb.search(txt) });
});

                                

Summary

Visualizing data in a genome browser can provide unique insights that are otherwise obscured by high-level summary statistics. Saying “it’s important to visualize your data” isn’t enough, put visualizations so front-and-center that the team has no choice but to ogle it a little as they review the higher-level analysis. Doing so will encourage serendipitous discoveries, and help to avoid being misled by summary statistics that mask weird edge cases.

If you are storing your sequencing data on S3 this pattern is great because you can build rich data-applications without any added cost – no additional server needs to be running 24/7 just to serve the files.

References

James T Robinson, Helga Thorvaldsdottir, Douglass Turner, Jill P Mesirov, igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV), Bioinformatics, Volume 39, Issue 1, January 2023, btac830, https://doi.org/10.1093/bioinformatics/btac830