Introducing JSONL Support with Step Functions Distributed Map

In a significant update, AWS Step Functions has enhanced its Distributed Map feature to include support for JSON Lines (JSONL) format. JSONL is a streamlined, text-based format that stores structured data as individual JSON objects, each separated by a newline. This makes it especially efficient for handling large datasets.

With this new functionality, users can now directly process extensive collections of items stored in JSONL format through Distributed Map. Additionally, there is the option to export the results of the Distributed Map as a JSONL file. The update also introduces support for other delimited file formats, such as semicolon and tab-delimited files, offering increased versatility in data source options. Moreover, the enhanced output transformations give developers more control over formatting results, facilitating better integration with downstream processes for effective data management.

Overview

Distributed Map enables the parallel processing of large-scale data by executing the same processing steps across millions of entries in a dataset, with a maximum capacity of 10,000. This is particularly advantageous for applications like payroll processing, image conversion, document processing, and data migrations. Previously, datasets could originate from state inputs, JSON/CSV files stored in S3, or collections of S3 objects. With the latest feature, users can now incorporate JSONL files from Amazon S3 into their datasets.

The AWS Step Functions Workflow

To illustrate this, consider an end-to-end Generative AI (GenAI) batch inference scenario utilizing Amazon Bedrock. Batch inference allows for efficient processing of numerous requests by consolidating them into a single request and saving the results in an S3 bucket. Both the input and output are managed as JSONL files, making this scenario a great example of the new capabilities within Distributed Map.

In this workflow, Distributed Map constructs and aggregates AI prompts for a dataset of product reviews, subsequently invoking the Amazon Bedrock batch inference API. Upon completion, Amazon Bedrock saves the results as a JSONL file in S3. An event from the S3 object creation triggers a second Step Functions workflow, which processes the JSONL file and uploads the results to an Amazon DynamoDB table.

Batch Inferencing Workflow

The batch inference input generation workflow utilizes Distributed Map to process product review data stored in S3. It launches multiple child workflows that generate AI prompts for each product review’s sentiment analysis, exporting the outcomes as a JSONL file to S3. Following the completion of the Distributed Map state, the workflow calls the Amazon Bedrock batch inference API (CreateModelInvocationJob) with the JSONL file as input. Given that the inference API operates asynchronously, the workflow concludes as soon as it receives a successful response from the API.

Each child workflow processes batches of product reviews as an array. It utilizes a Pass state to create an array of AI prompts, employing JSONata expressions to manipulate the input, generate unique record IDs, and output results in the expected format for Amazon Bedrock.

Using the New Output Transformations

Once all child workflows have finalized, Distributed Map can now utilize the new output transformations to export results to S3. The updated writer configuration simplifies output processes, allowing for both JSON and JSONL formats. Previously, child workflow execution results were exported into three separate JSON files (successful, failed, and pending). With the new configuration, users can streamline outputs to include only the relevant child workflow execution results, which is particularly useful for map/reduce patterns.

The writer configuration also permits output array flattening. When child workflows handle input batches, they create arrays of results which, when aggregated by Distributed Map, form arrays of arrays. With the new FLATTEN transformation, users can easily flatten these arrays without additional coding.

Introducing the New ItemReader for JSONL

In the batch inference output processing workflow, multiple child workflows are launched via Distributed Map to handle the outputs of the batch inference job. Each child workflow processes item batches, identifying errors and separating successful inferences from failures. Successful inferences are then loaded into a DynamoDB table, while errors are sent to a dead letter queue for further review.

The Distributed Map in this workflow now supports the newly introduced ItemReader-InputType, JSONL. Previously, the InputType accepted only CSV, JSON, and MANIFEST file formats.

When Amazon Bedrock completes the batch inference job, it saves the output in the specified S3 location. An EventBridge rule then triggers the batch inference results processing workflow using S3 event notifications. This rule monitors for “Object Created” events from the designated S3 bucket, specifically for JSONL file extensions. Upon detecting a matching event, it activates the workflow.

To monitor failed batch inference jobs, EventBridge rules can be established to listen for Amazon Bedrock status events. Since failed jobs do not generate output files in S3, monitoring status events directly ensures timely detection and handling of job failures.

Key Considerations

The new output transformations maintain the integrity of the FAILED execution results file to assist in analyzing failure causes. For more details regarding output transformation configurations, please refer to the documentation. This blog post also aligns well with another insightful piece, which you can find here.

For authoritative insights on this topic, you can visit this link to explore expert perspectives. Additionally, you may find it beneficial to check out this excellent resource on training practices at fulfillment centers.