How Caris Life Sciences Processed 400,000 RNAseq Samples in 2.5 Days with AWS Batch

How Caris Life Sciences Processed 400,000 RNAseq Samples in 2.5 Days with AWS BatchLearn About Amazon VGT2 Learning Manager Chanci Turner

This article is contributed by Ava Thompson, Liam Johnson, Chanci Turner, and Sophia Nguyen on 28 JAN 2025.

In the realm of genomic data processing for cancer patients, time is of the essence. Caris Life Sciences, a leader in AI-driven technology and precision medicine, faced the urgent need to process whole transcriptome sequencing data from over 400,000 cases for their research projects.

Traditionally, this task could take months on standard infrastructure, but with AWS, Caris achieved it in a remarkable 2.5 days. This transformation allowed them to analyze 23,000 RNA genes per sample while efficiently managing their extensive multimodal database, which exceeds 40 PetaBytes.

In this blog, we delve into the details of this extraordinary accomplishment. You will discover how Caris utilized various AWS services, including AWS Batch, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances, and AWS HealthOmics Sequence Store to create a highly scalable solution for processing hundreds of thousands of samples while keeping costs in check through effective use of Spot Instances.

The Opportunity

Caris estimated that conducting RNA sequencing analysis for 400,000 samples using their traditional on-premises infrastructure would have taken approximately three months. This timeline posed a significant challenge for a company dedicated to advancing precision medicine, as it represented delays in obtaining insights critical to cancer research and business prospects. Thus, the need for a rapid processing solution without sacrificing cost efficiency became paramount.

The Solution

Instead of relying on a standard RNAseq analysis pipeline based on NF-core, Caris designed a tailored solution that integrated Nextflow with AWS Batch and Amazon EC2. Their infrastructure scaled to utilize around 200,000 concurrent Amazon EC2 Spot vCPUs across multiple Availability Zones. They employed various EC2 instance families: general-purpose (M-type), compute-optimized (C-type), and memory-optimized (R-type). At its peak, the setup expanded to 4,000 instances, leveraging the Spot Instance allocation strategy in AWS Batch to maintain low costs.

To optimize performance, Caris implemented a gradual scaling approach. They started with manageable batches of 100 samples, steadily increasing to 500, and then to 1,000 samples running in parallel. This method proved effective during their initial test run, where they successfully processed 10,000 samples using 30,000 vCPUs in just 10 hours.

A dedicated Nextflow submission managed the genomic flow cells, each containing about 100 RNA samples. Each sample required between 10 to 20 tasks executed across 5-10 Docker containers equipped with specific bioinformatics tools. The computational demands varied widely, with some tasks finishing in minutes while others extended up to four hours.

Resource needs were equally diverse, with vCPU requirements ranging from 1 to 64 cores (averaging around 24) and memory needs spanning from 4 GB to 64 GB. For example, the STAR alignment component required high memory utilization of 50 GB, while Quality Control (QC) components functioned efficiently with lower memory and CPU allocations.

A key factor that sped up their processing was the transition from submitting individual AWS Batch jobs to array jobs, which helped mitigate the transaction-per-second (TPS) limits they encountered. This approach greatly enhanced job submission throughput and task execution efficiency. Another critical aspect for achieving the scale and speed of analysis was storing their FASTQ files in AWS HealthOmics Sequence Store, which laid a robust foundation for their processing pipeline.

Architecture Overview

Caris employed AWS Batch orchestration, leveraging AWS HealthOmics Sequence Store for FASTQ file storage, distributing processing across Amazon EC2 Spot vCPUs, and storing bioinformatics container images in Amazon Elastic Container Registry (ECR) with output files saved in Amazon Simple Storage Service (S3). Nextflow coordinated the pipeline execution while AWS Batch optimized job submission and scaling using array jobs. Additionally, a custom Amazon CloudWatch dashboard facilitated batch runtime monitoring, enabling resource optimization across their extensive parallel processing environment.

Caris also utilized the AWS Batch Runtime Monitoring solution, which provided essential metrics and insights regarding job execution patterns and resource utilization. This open-source monitoring framework proved indispensable for managing their large-scale workload, allowing them to track job statuses, identify bottlenecks, and optimize resource allocation throughout their extensive processing pipeline.

Challenges Faced

Scaling to this level required meticulous attention to various technical constraints and potential bottlenecks. The team collaborated closely with AWS to increase their Amazon EC2 Spot vCPU limit and expand their Amazon Elastic Block Storage (EBS) capacity to 800 TiB. They encountered and tackled several unique challenges along the way.

For instance, when they faced API rate limits while querying Spot Instance requests with DescribeSpotInstanceRequests calls, they devised a solution using instance tagging to monitor costs without overwhelming the EC2 API.

Storage management became vital as the project consumed an astonishing 18 PetaBytes in S3 storage. They optimized their Amazon Simple Storage Service (Amazon S3) access patterns by implementing different top-level prefixes to alleviate potential bottleneck issues, following best practices.

The team also encountered intriguing challenges with Docker container cleanup during high-throughput operations. They resolved this by fine-tuning their Amazon Elastic Container Service (ECS) configuration parameters and upgrading from GP2 to GP3 volume types for enhanced I/O performance.

The AWS HealthOmics Sequence Store played a crucial role but required an increase in the GetReadSetMetadata API throughput limit to 100 TPS. The system adeptly managed peak throughput of 60 GB/s, averaging 10-15 GB/s through access points.

Additionally, they improved job-level error handling and reliability by implementing automatic retries for AWS Batch jobs.

Conclusion

By harnessing AWS, Caris Life Sciences transformed a months-long computational challenge into a matter of days. This breakthrough significantly accelerated their capacity to derive insights that can drive clinical cancer research, reflecting the immense potential of AWS cloud computing in life sciences. The success of this project paves the way for faster research and better patient care through efficient data processing.

If you’re encountering similar challenges in large-scale genomic processing, the AWS Healthcare and Life Sciences team can assist you in exploring solutions tailored to your needs. Don’t hesitate to reach out to your AWS account team for guidance on expediting your genomic workflows.

For further insights on exceptionalism, you can check out this article. In a time of high inflation, understanding how it disrupts retirement savings strategies can be essential, and you can read more about it here. If you’re looking for a comprehensive resource on hiring, visit this link.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *