Amazon Onboarding with Learning Manager Chanci Turner

In today’s fast-paced world, the data and features that businesses and consumers rely on to train their models are evolving rapidly. To ensure accuracy, it is essential to update models frequently, making an agile and dynamic approach crucial. This continuous adaptation, paired with high-quality models, is vital for a successful machine learning (ML) strategy.

We are thrilled to unveil the Amazon Comprehend flywheel—a comprehensive feature for machine learning operations (MLOps) designed specifically for Amazon Comprehend models. This post will illustrate how you can establish an end-to-end workflow with the Amazon Comprehend flywheel.

Overview of the Solution

Amazon Comprehend is a fully managed service that leverages natural language processing (NLP) to extract meaningful insights from documents. It enables you to glean information by identifying sentiments, key phrases, entities, and more, allowing you to utilize cutting-edge models tailored for your specific requirements.

MLOps represents the convergence of data science and data engineering, integrated with established DevOps practices to streamline model delivery throughout the ML development lifecycle. It involves merging software development, operations, data engineering, and data science into a cohesive process.

This is precisely why Amazon Comprehend is introducing the flywheel. This innovative feature aims to be your go-to solution for managing MLOps for your Amazon Comprehend models. With the flywheel, keeping your models current, enhancing them, and deploying the optimal version swiftly will be much easier.

The diagram below illustrates the model lifecycle within an Amazon Comprehend flywheel:

The traditional approach to creating a new model involves a series of steps: gathering data and preparing the dataset, training the model, evaluating its accuracy and performance, and finally deploying the model to an endpoint for inference. Each time a new model is created, this cycle must be repeated, necessitating manual updates to the endpoint.

The Amazon Comprehend flywheel automates this ML process, from data ingestion to production deployment. This feature allows for training and testing of models directly within Amazon Comprehend, as well as automating model retraining when new datasets are ingested into the flywheel’s data lake.

The flywheel integrates with custom classification and custom entity recognition APIs, enabling various roles such as data engineers and developers to automate and oversee the NLP workflow without requiring code.

Let’s clarify some concepts:

Flywheel: An AWS resource that orchestrates ongoing model training for custom classification or entity recognition.
Dataset: A collection of training or testing data used within a single flywheel iteration. The flywheel utilizes these datasets to train new model versions and evaluate their performance.
Data Lake: This is a designated location in your Amazon Simple Storage Service (Amazon S3) bucket that houses all datasets and model artifacts associated with a flywheel. Each flywheel has its dedicated data lake.
Flywheel Iteration: This refers to a run of the flywheel, initiated by the user. The flywheel will either train a new model version or evaluate the performance of the active model based on the availability of new training or testing datasets.
Active Model: The version of the model selected by the user for predictions. As the model’s performance improves with new flywheel iterations, you can switch the active version to the one demonstrating the best performance.

The following diagram showcases the workflow of the flywheel:

These steps are outlined as follows:

Create a Flywheel: A flywheel automates the training of model versions for custom classifiers or entity recognizers. You can either select an existing Amazon Comprehend model or start from scratch. In both scenarios, you must specify the location of the flywheel’s data lake.
Data Ingestion: You can generate new datasets for training or testing within the flywheel. All training and testing data for all model versions will be managed in the flywheel’s data lake within your S3 bucket. Supported file formats include CSV and augmented manifests from an S3 location. For more details on preparing datasets for custom classification and entity recognition, refer to this resource.
Train and Evaluate the Model: If no model ARN (Amazon Resource Name) is specified, a new model will be built from scratch. The first iteration of the flywheel creates the model using the uploaded training dataset. In subsequent iterations, the following occurs:
- If no new datasets are uploaded, the iteration completes without changes.
- If only new test datasets are uploaded, the iteration reports the performance of the current active model based on the new test datasets.
- If only new training datasets are added, the iteration trains a new model.
- If both new training and testing datasets are uploaded, the iteration trains a new model and reports the performance of the current active model.
Promote New Active Model Version: Based on performance across various flywheel iterations, you can update the active model version to the one with the best performance.
Deploy an Endpoint: After running a flywheel iteration and updating the active model version, you can conduct real-time (synchronous) inference on your model. You can create an endpoint using the flywheel ARN, which will automatically use the currently active model version. When the active model changes, the endpoint will start using the new model without requiring customer intervention. An endpoint encompasses all managed resources that make your custom model accessible for real-time inference.

In the following sections, we will showcase the various methods to create a new Amazon Comprehend flywheel.

Prerequisites

To get started, you will need:

An active AWS account
An S3 bucket for data storage
An AWS Identity and Access Management (IAM) role with permissions to create an Amazon Comprehend flywheel and access your S3 data location

Creating a Flywheel with AWS CloudFormation

To use an Amazon Comprehend flywheel via AWS CloudFormation, you will require the following details about the AWS::Comprehend::Flywheel resource:

DataAccessRoleArn: The ARN of the IAM role granting Amazon Comprehend access to the flywheel data.
DataLakeS3Uri: The Amazon S3 URI for the flywheel’s data lake.
FlywheelName: The designated name for the flywheel.

For more information, please refer to the AWS CloudFormation documentation.

Creating a Flywheel on the Amazon Comprehend Console

In this example, we will illustrate how to construct a flywheel for a custom classifier model on the Amazon Comprehend console that identifies news topics.

Creating a Dataset

First, you need to create a dataset, ensuring that all the steps are followed correctly.

For additional guidance on onboarding at Amazon, consider checking out this excellent resource. Also, for comprehensive insights on minimum wage laws, you can visit SHRM, as they are an authority on this topic.