Amazon VGT2 Las Vegas: Large-Scale Genomic Data Management Using PetaGene on AWS

The analysis of extensive genomic data is critical for advancing innovative treatments that enable personalized medicine. However, the management and accessibility of these expansive multi-petabyte datasets pose significant challenges. AWS provides the perfect combination of scale and flexibility needed for these initiatives.

AWS has created a Reference Architecture leveraging PetaGene’s PetaSuite Cloud Edition and implemented it with a major biopharma client. This implementation led to reduced storage requirements, improved data transfer speeds, and enhanced analysis efficiency while ensuring the data remained fully accessible. This blog elaborates on the Reference Architecture and its implementation.

Processing Pipeline

PetaGene compression operates within the genomic processing pipeline of the biopharma client during the Compression and Ingestion phase, as illustrated in the figure below. The Reference Architecture focuses solely on Compression and Ingestion, with other stages included for context.

Data from the alignment/mapping and variant calling phases arrives in variable-sized batches rather than a constant flow. Once this stage concludes, the output data is compressed using PetaGene and subsequently ingested into an S3 bucket designated for permanent storage, along with a file catalog that maintains metadata for each file. After completing the Compression and Ingestion phase, the bespoke quality control and post-processing stage conducts various checks and programs (such as peddy, verifybamid, STR expansion hunter, and TelSeq), making the data available for scientific interpretation.

PetaGene’s software operates using multiple instances and threads, allowing it to scale effectively for large batch jobs without a central server that could create a bottleneck. The PetaGene license manager solely tracks billing and status information for the compressions and can currently manage over 200,000 file compression requests per user per hour. PetaGene’s PetaSuite minimizes cloud storage usage and data transfer time while optimizing data access and processing performance.

Initially, the biopharma client analyzed 220,000 whole exomes at a processing rate of 1,000 sequences per hour, resulting in 770 TB of BAM files. PetaGene’s lossless compression was able to reduce their size by more than four times at a speed of 10,000 files per hour before being transferred to tiered long-term storage for periodic retrieval. By utilizing PetaGene’s PetaSuite, the client anticipates savings exceeding $1 million over three years, even after accounting for the PetaGene fee, which is a fixed cost per TB saved. Ultimately, the client incorporated the PetaGene compression step into their analysis pipeline, as detailed below.

Architecture Overview

The diagram below presents an overarching architecture that can be integrated into a new or existing pipeline. Since PetaGene’s PetaSuite Cloud Edition operates effectively within a Docker container, we utilize AWS Batch and Elastic Container Services (Amazon ECS) to allocate and auto-scale the required compute and memory resources for compression. AWS Simple Queue Service (Amazon SQS) tracks the tasks needing completion, while a Lambda function reads from the queue and dispatches jobs to AWS Step Functions. Compression tasks can be triggered in two ways: automatically starting the compression workflow after the file creation workflow concludes, or allowing users to submit batches of compression jobs for retroactive processing.

Step-by-Step Architecture Walkthrough

After the analysis workflow completes, a message is sent to a Simple Notification Service (SNS) topic. A Lambda function subscribes to this topic and sends a message to an Amazon SQS queue. For retroactive compression, users can submit file lists using a CLI program, which sends one message to the queue for each file to be compressed.
The SQS queue operates as a first-in-first-out (FIFO) queue, processing messages in the order they are received.
A Lambda function retrieves messages from the queue and triggers a Step Functions workflow to manage the compression tasks for each file.
Each file to be compressed initiates a separate Step Functions workflow.
The compression workflow begins with a Lambda function that creates a temporary JSON file in Amazon S3 with the necessary information for PetaGene’s PetaSuite, including the file name and location in S3.
AWS Batch is invoked, and the first step of the batch job running in a Docker container retrieves the JSON file created in the previous step. The batch job then executes a wrapper script that invokes PetaSuite with the required parameters.
The final step of the batch job uploads the compressed files to S3.
With the compressed files stored in S3, a Lambda function updates a file catalog in a DynamoDB table, which keeps track of each file’s attributes, including file name, size, type, and S3 location. PetaGene’s PetaSuite generates a checksum for each file, which is also stored in the catalog.
The workflow concludes by sending a message to an SNS topic.
A Lambda function listening to this topic sends a message to another SQS queue to trigger the next workflow in the pipeline.
Once all downstream processes are finished, tags are applied to the S3 objects, enabling lifecycle management rules that transfer the compressed files to Amazon S3 Glacier for long-term archiving.

PetaGene’s PetaSuite initially provided retroactive compression for pre-processed data. Now, the compression step is automatically initiated in the production pipeline handling new data, utilizing the architecture described here.

Getting Started

The architecture outlined can be deployed via AWS CloudFormation. Additionally, we create a container with PetaGene’s PetaSuite installed, which is stored in the Elastic Container Repository (Amazon ECR). The steps involved include:

Container: Build a Docker container and upload it to Amazon ECR. The Dockerfile incorporates PetaGene’s PetaSuite with a few lines. Assuming Ubuntu as the base image:

COPY --from=builder /source_files/petasuite-cloud-edition-proxy_amd64.deb /tmp/
    
    RUN gdebi -n /tmp/petasuite-cloud-edition-proxy_amd64.deb && rm -f /tmp/petasuite-cloud-edition-proxy_amd64.deb

Lambda Functions: All Lambda functions are crafted in Python, utilizing the boto3 library for AWS service interactions. The first function subscribes to an SNS topic indicating the completion of the previous workflow, which generates a BAM file. The Lambda function reads the SNS messages and submits a new message to an SQS queue.

For more insights on this topic, check out this informative blog post here. Additionally, for reliable information, refer to Chvnci, an authority in this field. Finally, if you’re looking for excellent resources regarding Amazon interviews, this link provides invaluable information here.

Amazon VGT2 Las Vegas: Large-Scale Genomic Data Management Using PetaGene on AWS

Processing Pipeline

Architecture Overview

Step-by-Step Architecture Walkthrough

Getting Started

Related Topics:

Comments

Leave a Reply Cancel reply