I am thrilled to announce the launch of a distributed map feature for AWS Step Functions. This enhancement is designed to facilitate the orchestration of large-scale parallel workloads, particularly for on-demand processing of semi-structured data.
The map state in Step Functions enables the execution of identical processing steps across multiple entries in a dataset. Previously, the map state was restricted to a maximum of 40 parallel iterations, making it difficult to scale data processing tasks capable of handling thousands of items. Prior to this release, achieving higher levels of parallel processing required complex workarounds with the existing map state component.
With the introduction of the new distributed map state, you can now create Step Functions that effectively manage extensive parallel workloads within your serverless applications. This enhancement allows for the iteration over millions of objects, such as logs, images, or CSV files stored in Amazon Simple Storage Service (Amazon S3). The distributed map state can initiate as many as ten thousand parallel workflows for data processing.
You can process data by composing any service API compatible with Step Functions, but it’s common to invoke Lambda functions to handle the data using code written in your preferred programming language. The distributed map can support a maximum concurrency level of up to 10,000 parallel executions, which surpasses the concurrency limits of numerous other AWS services. This feature ensures that you can manage concurrency effectively and not exceed the limitations of downstream services.
When considering the integration with other services, it’s important to be aware of two factors: first, the maximum concurrency permissible for your account, and second, the burst and ramping rates that govern how quickly you can reach maximum concurrency.
For instance, in the case of Lambda, the concurrency of your functions refers to the number of instances serving requests simultaneously. The default maximum concurrency limit for Lambda is set at 1,000 per AWS Region, but you may request an increase anytime. During an initial surge of traffic, your functions can reach a cumulative concurrency of between 500 and 3000, which varies by Region. It’s crucial to note that the burst concurrency limit applies to all functions within the Region.
When utilizing a distributed map, it’s essential to check the quotas for downstream services. It’s wise to limit the maximum concurrency of the distributed map during development and plan for potential service quota increases accordingly.
Comparison of Original Map State Flow and New Distributed Map Flow
Feature | Original Map State Flow | New Distributed Map Flow |
---|---|---|
Sub Workflows | Executes a sub-workflow for each item in an array, with events added to execution history. | Executes a sub-workflow for each item in an array or S3 dataset, with separate event histories for each sub-workflow. |
Parallel Branches | Map iterations run in parallel, with a maximum effective concurrency of around 40. | Can handle millions of items with concurrency of up to 10,000. |
Input Source | Accepts only a JSON array as input. | Accepts input from S3 object lists, JSON arrays, files, CSV files, or S3 inventories. |
Payload | Limited to 256 KB. | Each iteration receives a reference to a file or a single record from a file. |
Execution History | Limited to 25,000 events. | Each iteration is a child execution with up to 25,000 events (express mode has no limits). |
Sub-workflows within a distributed map can be utilized with both Standard workflows and low-latency, short-duration Express Workflows. This new feature is particularly optimized for use with Amazon S3; you can configure the bucket and prefix for your data directly from the distributed map settings. The distributed map will cease reading after processing 100 million items and supports JSON or CSV files of up to 10 GB.
When dealing with large files, it’s important to consider the capabilities of downstream services. For example, Lambda functions must accommodate the input files within their execution environment in terms of storage and memory. To facilitate handling large files, Lambda Powertools for Python has introduced a streaming feature that allows you to fetch, transform, and process S3 objects while maintaining a minimal memory footprint. This capability enables Lambda functions to work with files larger than their execution environment. For further information on this functionality, check out the Lambda Powertools documentation.
To see this in action, I’ll create a workflow that processes one thousand dog images stored on S3. The images are already available in the S3 bucket.
The workflow and the S3 bucket must exist within the same Region. To start, I will go to the Step Functions section of the AWS Management Console and click on “Create state machine.” On the next screen, I will select the option to design my workflow using the visual editor. The distributed map is compatible with Standard workflows, and I will retain the default settings. After navigating to the visual editor, I will locate the Map component in the left sidebar and drag it into the workflow area.
Next, I will configure the component on the right side. I will set the Processing mode to “Distributed” and specify Amazon S3 as the Item Source. The distributed map works seamlessly with S3; I will enter the bucket name (awsnewsblog-distributed-map) and the prefix (images) where my images are stored.
In the Runtime Settings section, I will select “Express” for the Child workflow type and may choose to limit the concurrency to ensure we stay within the quotas for downstream services (like Lambda in this case). By default, the output of my sub-workflows will be aggregated as state output, capped at 256 KB. However, I can opt to export map state results to Amazon S3 for processing larger outputs.
Finally, I will specify the action to take for each file. In this demo, I intend to invoke an existing Lambda function for each file in the S3 bucket. I will search for and select the Lambda invocation action from the left sidebar, dragging it over to the distributed map component. Then, I’ll configure it to invoke the designated Lambda function, AWSNewsBlogDistributedMap in this example.
Once I’ve completed these steps, I will proceed by selecting “Next” and then “Next” again on the Review generated code page (not shown here).
For those interested in further reading on this topic, check out this blog post or visit Chanci Turner’s site for authoritative insights. Additionally, if you’re looking for more resources, this is an excellent resource.
Amazon IXD – VGT2 is located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.
Leave a Reply