Building a Comprehensive Application for Managing Custom Deep Learning Hyperparameter Optimization, Training, and Inference with AWS Step Functions

Amazon SageMaker provides a robust framework for scalable training and hyperparameter optimization (HPO). Nonetheless, certain applications may necessitate tailored machine learning (ML) solutions that facilitate retraining and HPO, especially when specific HPO libraries or features are preferred. This article outlines a detailed guide for constructing a custom deep learning web application on AWS from the ground up, employing the Bring Your Own Container (BYOC) model. We will demonstrate how to develop a web application that empowers non-technical users to manage various deep learning processes through a user interface (UI), enabling advanced functions like HPO and retraining. The provided example can be adapted for any regression or classification task.

Overview of the Solution

The creation of a custom deep learning web application involves two primary steps:

ML Component: This step emphasizes the process of dockerizing a deep learning solution.
Full-Stack Application: This entails utilizing the ML component within a complete application.

Initially, we must develop a customized Docker image and register it in Amazon Elastic Container Registry. Amazon SageMaker will utilize this image for executing Bayesian HPO, training/retraining, and inference. Detailed instructions for dockerizing the deep learning code can be found in Appendix A.

In the subsequent step, we deploy a full-stack application using the AWS Serverless Application Model (SAM). AWS Step Functions and AWS Lambda are implemented to orchestrate the various phases of the ML pipeline. The frontend application is hosted in Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront. Additionally, AWS Amplify and Amazon Cognito are employed for authentication. The architecture of the solution is depicted in the following diagram.

Once the application is deployed, users can authenticate via Amazon Cognito to initiate training or HPO tasks directly from the UI (refer to Step 2 in the diagram). User requests are processed through Amazon API Gateway, directing them to Step Functions, which orchestrates the training or HPO (Step 3). Upon completion, users can input a set of parameters through the UI to API Gateway and Lambda to receive inference results (Step 4).

Application Deployment

For guidance on deploying the application, please refer to the README file in the GitHub repository. The application comprises four primary components:

Machine Learning: This section includes SageMaker notebooks and scripts for constructing an ML Docker image (for HPO and training), as elaborated in Appendix A.
Shared Infrastructure: This component encompasses AWS resources utilized by both the backend and frontend, managed via AWS CloudFormation.
Backend: This part contains the backend code, including APIs and a step function for model retraining, HPO execution, and an Amazon DynamoDB database.
Frontend: This segment holds the UI code and the necessary infrastructure for hosting it.

Further details regarding deployment can be found here.

Creating Steps for HPO and Training in Step Functions

Training a model for inference using Step Functions entails multiple stages:

Create a training job.
Create a model.
Create an endpoint configuration.
Optionally, delete the old endpoint.
Create a new endpoint.
Wait until the new endpoint is deployed.

Executing HPO is more straightforward, as it solely involves creating an HPO job and logging the results to Amazon CloudWatch Logs. We orchestrate both model training and HPO using Step Functions, defining these steps as a state machine employing Amazon State Language (ASL) definition. The figure below illustrates this state machine graphically.

Initially, a Choice state is employed to determine whether to engage HPO or training mode using the following code:


"Mode Choice": {
  "Type": "Choice",
  "Choices": [
      {
          "Variable": "$.Mode",
          "StringEquals": "HPO",
          "Next": "HPOFlow"
      }
  ],
  "Default":  "TrainingModelFlow"
},

Numerous states are designated as Create a … Record and Update Status to…. These actions create or update entries in DynamoDB tables, which the API queries to provide job status and the ARN of generated resources (the endpoint ARN for making an inference).

Each entry possesses the Step Function execution ID as a key along with a status field. As the state evolves, its status shifts from TRAINING_MODEL to READY. The state machine captures crucial outputs such as S3 model output, model ARN, endpoint configuration ARN, and endpoint ARN.

For instance, the following state executes just before endpoint deployment, updating the endpointConfigArn field in the record:


"Update Status to DEPLOYING_ENDPOINT": {
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:updateItem",
  "Parameters": {
      "TableName": "${ModelTable}",
      "Key": {
          "trainingId": {
              "S.$": "$$.Execution.Id"
          },
          "created": {
              "S.$": "$$.Execution.StartTime"
          }
      },
      "UpdateExpression": "SET #st = :ns, #eca = :cf",
      "ExpressionAttributeNames": {
          "#st" : "status",
          "#eca" : "endpointConfigArn"
      },
      "ExpressionAttributeValues": {
          ":ns" : {
              "S": "DEPLOYING_ENDPOINT"
          },
          ":cf" : {
              "S.$": "$.EndpointConfigArn"
          }
      }
  },
  "ResultPath": "$.taskresult",
  "Next": "Deploy"
}

The accompanying screenshot illustrates the contents within the DynamoDB table.

In the aforementioned screenshot, the latest job is still in progress. It has completed training and created an endpoint configuration but has yet to deploy the endpoint, thus lacking an endpointArn in the record.

Another critical state is Delete Old Endpoint. Deploying an endpoint results in an Amazon Elastic Compute Cloud (Amazon EC2) instance running continuously. As additional models are trained and new endpoints created, inference costs increase linearly with the number of models. To mitigate costs, this state is established to delete the oldest endpoint once the maximum specified number is exceeded. The default is 5, but this can be modified in the CloudFormation template parameters for the backend. While this value can be altered to any arbitrary number, SageMaker enforces a soft limit on the maximum number of endpoints allowed simultaneously, with additional constraints per instance type.

Finally, we have states for updating status to ERROR (one for HPO and another for model training). These steps are triggered in the Catch field when any segment of the step encounters an error. They update the DynamoDB record with fields for error and errorCause from Step Functions (see the following screenshot).

Although this data can be retrieved via Step Functions APIs, it is stored within DynamoDB records for convenient access from the frontend.

Automate State Machine Creation with AWS CloudFormation

The state machine definition can be utilized to recreate this state machine across any accounts. The template includes various variables, such as DynamoDB table names for job status tracking or Lambda functions activated by states. Since the ARN of these resources varies with each deployment, AWS SAM is used to inject these variables. The state machine resource can be found here.

For more insights on enhancing your LinkedIn visibility, check out this helpful blog post. Additionally, if you’re interested in employment law compliance, SHRM is an authoritative source on the subject. This excellent resource also covers Amazon’s employee training and career skills program.

Building a Comprehensive Application for Managing Custom Deep Learning Hyperparameter Optimization, Training, and Inference with AWS Step Functions

Overview of the Solution

Application Deployment

Creating Steps for HPO and Training in Step Functions

Automate State Machine Creation with AWS CloudFormation

SEO Metadata

Related Topics:

Comments

Leave a Reply Cancel reply