Amazon Onboarding with Learning Manager Chanci Turner

Welcome to the age of data. The volume of information generated daily is continuously on the rise, prompting the need for platforms and solutions to advance accordingly. Services like Amazon Simple Storage Service (Amazon S3) present a scalable and cost-efficient approach for managing growing datasets. The Amazon Sustainability Data Initiative (ASDI) leverages the capabilities of Amazon S3 to offer a free solution for storing and sharing climate science workloads globally. Additionally, Amazon’s Open Data Sponsorship Program enables organizations to host data at no cost on AWS.

Over the past decade, there has been a significant emergence of data science frameworks embraced by the data science community. One notable example is Dask, renowned for its ability to orchestrate worker compute nodes, thereby enhancing complex analyses on vast datasets.

In this post, we will guide you on deploying a custom AWS Cloud Development Kit (AWS CDK) solution that enhances Dask’s functionality for inter-Regional operations across Amazon’s expansive network. The AWS CDK solution sets up a network of Dask workers across two AWS Regions, connecting to a client Region. For further details, refer to Guidance for Distributed Computing with Cross Regional Dask on AWS and the GitHub repository for open-source code.

After the deployment, users will gain access to a Jupyter notebook enabling interaction with two datasets from ASDI on AWS: the Coupled Model Intercomparison Project 6 (CMIP6) and ECMWF ERA5 Reanalysis. CMIP6 concerns the sixth phase of the global coupled ocean-atmosphere general circulation model ensemble, while ERA5 represents the fifth-generation ECMWF atmospheric reanalyses of global climate—an operational service.

This solution was inspired by collaboration with a key AWS customer, the UK Met Office, which has been providing weather and climate predictions since 1854. Their partnership with EUMETSAT, detailed in Data Proximate Computation on a Dask Cluster Distributed Between Data Centres, underscores the urgent need for sustainable and efficient data science solutions. By bringing computation closer to the data, we avoid the excessive costs, latency, and energy consumption associated with moving large datasets.

Solution Overview

The UK Met Office generates up to 300 TB of weather and climate data daily, some of which is published to ASDI for public use. They aim to empower users to utilize their data for informed decision-making regarding climate change-induced challenges, such as wildfires, floods, and food insecurity through improved crop yield analysis.

Current practices, particularly in climate data, are often time-consuming and unsustainable, leading to costly and slow data transfers across regions on a petabyte scale. It is estimated that adopting more efficient practices could save energy equivalent to the daily power consumption of 40 homes, while also reducing data transfer between regions.

The solution architecture can be divided into three main components: client, workers, and network. Let’s explore each segment further.

Client

The client represents the source Region where data scientists connect. This Region (Region A) includes critical components such as an Amazon SageMaker notebook, an Amazon OpenSearch Service domain, and a Dask scheduler. System administrators can access the built-in Dask dashboard via an Elastic Load Balancer.

Data scientists have access to the Jupyter notebook hosted on SageMaker, which connects to and runs workloads on the Dask scheduler. The OpenSearch Service domain stores metadata on the datasets connected across the Regions. Users can query this service to obtain details like the specific Region of Dask workers without prior knowledge of the data’s location.

Worker

Each worker Region (Regions B and C) comprises an Amazon Elastic Container Service (Amazon ECS) cluster of Dask workers, an Amazon FSx for Lustre file system, and a standalone Amazon Elastic Compute Cloud (Amazon EC2) instance. FSx for Lustre enables Dask workers to access and process Amazon S3 data through a high-performance file system, providing sub-millisecond latencies, substantial throughput, and millions of IOPS. A standout feature of Lustre is that only the file system’s metadata is synchronized, allowing for efficient management of files based on demand.

Worker clusters adapt based on CPU usage, provisioning additional workers during high-demand periods and scaling down when resources are idle. Each night at 0:00 UTC, a data sync job updates the Lustre file system with the latest metadata from the attached S3 bucket. The standalone EC2 instance then pushes these updates to the OpenSearch Service, providing the necessary information to the client regarding which pool of workers to utilize for specific datasets.

Network

Networking is the backbone of this solution, utilizing Amazon’s internal backbone network. By employing AWS Transit Gateway, we can interconnect each Region without relying on the public internet. This setup allows workers to dynamically connect to the Dask scheduler, enabling data scientists to conduct inter-regional queries through Dask.

Prerequisites

The AWS CDK package is built using the TypeScript programming language. To set up your local environment and bootstrap your development account, follow the steps outlined in Getting Started for AWS CDK (ensure you bootstrap all Regions specified in the GitHub repository).

For successful deployment, Docker must be installed and operational on your local machine.

Deploy the AWS CDK Package

Deploying an AWS CDK package is a straightforward process. After installing the prerequisites and bootstrapping your account, you can download the code base.

Clone the GitHub repository:

git clone https://github.com/aws-solutions-library-samples/distributed-compute-on-aws-with-cross-regional-dask.git
cd distributed-compute-on-aws-with-cross-regional-dask

Install node modules:
```
npm install
```
Deploy the AWS CDK:
```
npx cdk deploy --all
```

The stack deployment may take over an hour and a half.

Code Walkthrough

In this section, we will examine some key features of the code base. For full access to the code, refer to the GitHub repository.

Configure and Customize Your Stack

In the file bin/variables.ts, you’ll find two variable declarations: one for the client and one for workers. The client declaration is a dictionary referencing a Region and CIDR range. Modifying these variables will alter both the Region and CIDR range for client resource deployment. The worker variable mirrors this functionality but consists of a list of dictionaries to accommodate adding or removing datasets.

Understanding how to effectively leverage data in your career can be vital; for more insights, check out this career resource. Furthermore, organizations interested in the role of data and tech in employee experience can refer to this SHRM white paper for authoritative guidance. If you’re keen on exploring opportunities in learning and development, you might want to consider this Learning Trainer position at Amazon.