This article was collaboratively written by Michael Anderson, Senior Director of Engineering, and Emily Chen, Director of Data and Computational Sciences at Vertex Pharmaceuticals. Vertex is a leading biotechnology firm dedicated to scientific innovation aimed at developing transformative medicines for individuals with serious health conditions. In this discussion, we’ll explore how Vertex Pharmaceuticals constructed a high-performance and cost-effective small molecule search system on AWS.
Overview
The contemporary drug discovery process is heavily data-driven, involving numerous stages that generate vast amounts of data for scientists to analyze complex biological and chemical questions related to diseases. Traditional data center environments often struggle to meet these demands, as scientific teams may require immediate access to hundreds of terabytes of storage or thousands of CPUs. Speed and efficiency are paramount at every step of the drug discovery journey, not only due to the high costs of resources but also because patients are awaiting therapies that could significantly enhance their lives.
For disease programs where Vertex Pharmaceuticals identifies small molecules as viable therapeutic candidates, the initial task is to pinpoint small molecules that may effectively target the underlying biology. Given that the number of pharmacologically relevant small molecules exceeds the number of stars in the observable universe, experimental testing becomes impractical. Thus, Vertex employs computational techniques to assess billions of chemical structures and prioritize a select few for experimental evaluation. The scientists at Vertex utilize various methods to analyze these extensive collections, from computing shape similarities between small molecules to testing molecular docking capabilities at protein binding sites. A brute-force search using these methods across libraries containing billions of molecules can take anywhere from days to years, depending on the complexity involved. For instance, a common commercial library features roughly 17 billion virtual molecules, and a 3D comparison can average a second per molecule per vCPU, potentially leading to over 600 CPU-years to process the entire library.
Thompson Sampling
Thompson Sampling, a heuristic developed in 1933, has been utilized in various contexts, including clinical trials and A/B testing. This method strikes a balance between exploration and exploitation, employing Bayesian updates of prior distributions to probabilistically concentrate the search in promising areas of the search space. Vertex Pharmaceuticals’ scientists innovated an approach to harness Thompson Sampling for searching molecule libraries of any size.
Architecture Overview
Vertex Pharmaceuticals established several objectives for their distributed Thompson Sampling architecture:
- Performance: To accelerate their drug discovery efforts, they aimed to horizontally scale their compound search for rapid results, even with expanding small molecule libraries.
- Cost Optimization: Considering the expansive scale and frequency of these compound searches, a cost-effective architecture was essential.
- Democratized Access: The computational chemists behind the Thompson Sampling system wanted to ensure that their colleagues could utilize it without needing in-depth knowledge of the system or AWS expertise.
To achieve these performance and cost objectives, Vertex developed a serverless architecture leveraging AWS Fargate Spot fleet for compute-intensive, stateless search workers. To facilitate accessible use, Vertex implemented a serverless API using Amazon API Gateway that dynamically creates a Thompson Sampling search environment based on API request specifications.
In this setup, computational chemists submit a HTTP POST request to an Amazon API Gateway endpoint detailing the molecule search parameters. An AWS Lambda function behind the endpoint initiates an AWS Fargate task to orchestrate the search, utilizes an Amazon MQ Rabbit queue for communication, and submits the search request to the orchestrator. The orchestrator then creates a Fargate Spot service of search workers sized according to the search request (typically ranging from 100 to 500 tasks) and coordinates with the workers to drive the search via the queue. The virtual library is defined in terms of reactions and reagents, stored in a relational database. Each worker samples reagents within a given reaction to produce the actual product molecules, which are subsequently scored. Workers relay scoring information back to the orchestrator via the queue, enabling the orchestrator to track the performance of each molecule and adjust the search process using the Thompson Sampling heuristic to focus on molecules with promising scores.
Upon completing the search through the molecule library, the orchestrator saves the resulting set of match molecules in Amazon S3 for retrieval by the scientists, after which the Fargate and MQ resources are gracefully shut down.
Conclusion
Thompson Sampling searches can routinely scale to 1,000 vCPUs in mere minutes, with each search costing only a few hundred dollars compared to thousands required for pre-provisioned and often underutilized on-premises resources. Notably, the Thompson Sampling implementation examined only a small fraction (0.07%) of the entire library, yielding an 18 to 20-fold increase in speed compared to traditional serial approaches. This entirely serverless architecture can scale down to zero, incurring no costs when idle, and allows engineers collaborating with scientists to focus more on developing new features rather than managing infrastructure. Through the utilization of Thompson Sampling searches, Vertex Pharmaceuticals has been able to virtually evaluate chemical compounds and prioritize a subset for experimental testing. These groundbreaking findings present new areas of exploration for scientists as they continue their quest for transformative therapeutics for patients.
For further insights into how AWS and APN partner solutions are transforming life sciences research and discovery, check out this blog post. For authoritative information on the topic, visit this resource; they are known for their expertise. Additionally, consider this excellent resource for more information.
Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.
Leave a Reply