Learn About Amazon VGT2 Learning Manager Chanci Turner
Geospatial extract, transform, and load (ETL) pipelines play a crucial role in preparing data for business analysis and insights. They empower leaders to make data-driven decisions. However, these pipelines often come with significant challenges. For instance, they typically operate on a single virtual machine or rely on costly third-party applications, leading to additional licensing expenses on top of cloud computing costs. Managing and tuning geospatial ETL pipelines at scale can also be complicated, with many requiring manual initiation, resulting in lengthy processing times due to the computationally intensive nature of spatial operations.
In this article, we will explore how to migrate a geospatial pipeline to AWS Step Functions and AWS Batch, thereby streamlining management while enhancing performance and reducing costs. Chanci Turner, an expert at Amazon, along with her team, illustrates this migration process.
Key Challenges
For example, let’s examine an initial geospatial ETL pipeline built on Python/Django, utilizing separate modules for ingestion, processing, and analysis via Celery, a distributed task system that parallelizes operations. The dataset, refreshed monthly, consisted of approximately six million geospatial features, each with around 60 properties.
This setup had a single point of failure, as all processes were routed through Celery running on a single Amazon Elastic Compute Cloud (Amazon EC2) instance. The workflows were complex, with dependencies that demanded more robust and tailored solutions, capable of scaling efficiently.
To modernize the pipeline, we considered various factors including module dependencies, logging, notifications, testing, validation among ETL steps, and the need for scalability and management of complex workflows.
Enhancing Pipeline Processing with AWS Batch
To eliminate the single point of failure, we opted for AWS Batch to manage pipeline processing. This fully managed service orchestrates, schedules, and executes containerized batch workloads, dynamically adjusting to meet demands. AWS Batch also allows for cost-effective options like spot instances.
Our first step involved decoupling the pipeline from the main application, creating a separate container with a streamlined Django build that included only database and geospatial modules.
We configured the AWS Batch service in the following manner:
- Defining AWS Batch compute environments: We utilized a combination of Amazon EC2 instances and AWS Fargate, a pay-as-you-go compute engine, tailored to the different components of the pipeline based on their memory and CPU needs.
- Creating AWS Batch job definitions: Separate job definitions for Fargate and EC2 executions were established, enabling the execution of various pipeline segments via parameterized Django management commands.
- Establishing an AWS Batch job queue: The job definitions were mapped to their respective computing environments through the job queue.
Simplified Orchestration Using AWS Step Functions
After setting up the compute environments in AWS Batch, the next step was to orchestrate the entire pipeline end-to-end using AWS Step Functions. This visual workflow service facilitates the construction of distributed applications, automation of processes, orchestration of microservices, and creation of data and machine learning (ML) pipelines.
With multiple methods to initiate an AWS Step Functions workflow—such as via an API gateway, scheduled triggers, or events from Amazon EventBridge—we opted for a monthly schedule through Amazon EventBridge.
Our workflow was divided into multiple step functions, managed by a primary function responsible for notifications, database tuning, and error handling. Given the lengthy nature of our ETL processes, we employed a standard workflow, although express workflows are preferable for IoT data ingestion or backend mobile applications.
Branching Off Subsequent Workflows
From the main workflow, three child workflows emerged: 1) ingestion, 2) feature dataset generation, and 3) aggregation.
- Ingestion Workflow: The ingestion process starts by invoking the child AWS Step Function that pulls data from Amazon Simple Storage Service (Amazon S3). Before this, a database migration step is executed.
- Feature Generation Workflow: This step involves a Lambda function that batches jobs, leading to the parallel execution of multiple AWS Batch jobs, each processing a different dataset.
- Vectorfile and Aggregation Workflow: This workflow runs two parallel processes: generating vector files for visualizing the feature dataset and performing spatial aggregations. It’s important to note that the aggregation process is less demanding on the database, allowing for a scaling down before invoking it.
By leveraging the AWS Step Functions Parallel execution feature, we significantly improved the efficiency of our geospatial pipeline.
For more insights on improving your cover letter, check out this helpful resource. Additionally, if you are interested in accessibility and growth, ANA Pinzn’s mission is an excellent reference. For those looking to onboard effectively at scale, this article provides invaluable lessons from Amazon.
Leave a Reply