Pre-training Genomic Language Models with AWS HealthOmics and Amazon SageMaker

Genomic language models are revolutionizing the application of large language models to tackle challenges in genomics. This blog post and open-source project demonstrate how to pre-train a genomics language model known as HyenaDNA using genomic data in the AWS Cloud. We utilize AWS HealthOmics for an efficient and economical omic data storage solution, alongside Amazon SageMaker, a fully managed machine learning (ML) service for model training and deployment.

Understanding Genomic Language Models

Genomic language models introduce a novel methodology in genomics, enabling the interpretation of DNA language. These models harness the transformer architecture, a type of natural language processing (NLP), to analyze the extensive genomic data available, allowing researchers to derive valuable insights more accurately than traditional in silico techniques and more affordably than in situ methods.

By connecting raw genetic data with actionable insights, genomic language models have the potential to impact various sectors, including whole-genome analysis, healthcare delivery, pharmaceuticals, and agriculture. They aid in discovering new gene functions, pinpointing disease-causing mutations, and crafting personalized treatment plans, thereby fostering innovation in genomics-driven fields. The capability to analyze and interpret genomic data at scale is vital for precision medicine, agricultural enhancements, and biotechnological advancements, making genomic language models a strong candidate for foundational technology in these areas.

Some notable genomic language models include:

DNABERT, one of the early attempts to utilize the transformer architecture for understanding DNA language, employing a Bidirectional Encoder Representations from Transformers (BERT) architecture pre-trained on a human reference genome, yielding promising results for downstream supervised tasks.
Nucleotide Transformer, which shares a similar architecture with DNABERT, demonstrated that pre-training on larger datasets and expanding the context window size improves accuracy in downstream tasks.
HyenaDNA, which modifies the transformer architecture by replacing self-attention layers with a Hyena operator, allowing it to process up to 1 million tokens, significantly surpassing the capacity of previous models and enabling it to learn long-range interactions within DNA.

In our exploration of advanced models that expand the frontiers of genetic sequence analysis, we emphasize HyenaDNA. Pre-trained HyenaDNA models are available on Hugging Face, simplifying integration into existing projects or serving as a foundation for new genetic sequence analysis endeavors.

AWS HealthOmics and Sequence Stores

AWS HealthOmics is a specialized service designed to assist healthcare and life science organizations, along with their software partners, in storing, querying, and analyzing genomic, transcriptomic, and other omics data, ultimately leading to improved health outcomes and deeper biological insights. It supports large-scale analysis and collaborative research through HealthOmics storage, analytics, and workflow capabilities.

With HealthOmics storage, users benefit from a managed, omics-focused, findable, accessible, interoperable, and reusable (FAIR) data store that allows for cost-effective storage, organization, sharing, and access to vast amounts of bioinformatics data at a low cost per gigabase. HealthOmics sequence stores offer cost savings through automatic tiering and compression of files based on usage, enhanced sharing and findability via biologically-focused metadata and provenance tracking, and instant access to frequently used data through low-latency Amazon Simple Storage Service (Amazon S3) compatible APIs or HealthOmics native APIs. All of this is managed by HealthOmics, relieving customers of the complexities of file organization.

Amazon SageMaker

Amazon SageMaker is a fully managed ML service from AWS, aimed at minimizing the time and expenses associated with training and tuning ML models at scale. With SageMaker Training, a managed batch ML compute service, users can train models efficiently without the hassle of managing the underlying infrastructure. SageMaker supports popular deep learning frameworks, including PyTorch, which is crucial for the solutions discussed here.

SageMaker also provides a wide array of ML infrastructure and model deployment options to address all your ML inference requirements.

Solution Overview

In this blog post, we outline the process of pre-training a genomic language model utilizing assembled genome data. This genomic information could either be public (such as data from GenBank) or your own proprietary data.

The workflow starts with genomic data. For illustration, we use a public non-reference mouse genome from GenBank, part of The Mouse Genomes Project, which reflects a consensus genome sequence of inbred mouse strains. This genomic dataset can easily be substituted with proprietary datasets relevant to your research.

We begin with a SageMaker notebook to process the genomic files and import them into a HealthOmics sequence store. A second SageMaker notebook is then utilized to initiate the training job.

During the managed training job in the SageMaker environment, the mouse genome is downloaded via the S3 URI provided by HealthOmics. The training job subsequently retrieves the checkpoint weights of the HyenaDNA model from Hugging Face. These weights are pre-trained on the human reference genome, allowing the model to comprehend and predict genomic sequences, thus establishing a robust baseline for further specialized training across various genomic tasks.

Using these resources, the HyenaDNA model undergoes training, refining its parameters with the mouse genome data. Upon completion of pre-training, and once validation results meet the required standards, the trained model is stored in Amazon S3. Finally, we deploy the model as a SageMaker real-time inference endpoint and test it against a set of known genome sequences through inference API calls.

Data Preparation and Loading into Sequence Store

The initial phase of our machine learning workflow involves preparing the data. We upload the genomic sequences to a HealthOmics sequence store. While FASTA files are the standard format for storing reference sequences, we convert these to FASTQ format. This conversion aligns with the format expected for assembled data from sequenced samples.

In the accompanying Jupyter notebook, we demonstrate how to download FASTA files from GenBank, convert them to FASTQ files, and load them into a HealthOmics sequence store. If you already possess genomic data in a sequence store, you can bypass this step.

Training on SageMaker

We employ PyTorch and Amazon SageMaker script mode for training the model. Script mode’s compatibility with PyTorch was instrumental, allowing us to utilize our existing scripts with minimal changes. For training, we extract the data from the sequence store using the provided S3 URIs.

For further insights into optimizing your workflow, consider checking out another blog post on the science of habits. Additionally, if you’re navigating employment law compliance, especially regarding background screening processes in New York City, refer to this authoritative resource. Lastly, for job opportunities in learning and development, this position at Amazon is an excellent resource.

SEO Metadata