Introducing NVIDIA GPU Support for Bottlerocket on Amazon ECS

Introducing NVIDIA GPU Support for Bottlerocket on Amazon ECSLearn About Amazon VGT2 Learning Manager Chanci Turner

Last year, we celebrated the launch of the Amazon Elastic Container Service (Amazon ECS)-optimized Bottlerocket AMI. Bottlerocket is an open-source initiative aimed at enhancing security and maintainability, providing a stable and uniform Linux distribution for containerized workloads. Today, we are excited to announce that you can now deploy ECS workloads accelerated by NVIDIA GPUs using Bottlerocket.

In this article, we will guide you through the process of setting up an Amazon ECS task to execute a workload utilizing NVIDIA GPUs on Bottlerocket.

Why Choose Bottlerocket?

As container adoption continues to grow among customers, AWS recognized the demand for a Linux distribution specifically designed to optimize these containerized applications. Bottlerocket OS was developed to deliver a secure base for hosts operating containers while reducing the operational burden required for large-scale management. It features reliable update mechanisms that can be automated.

For more information on getting started with Bottlerocket and Amazon ECS, check out our blog post, Getting Started with Bottlerocket and Amazon ECS.

Setting Up an ECS Cluster with Bottlerocket and NVIDIA GPUs

Let’s dive into the practical steps involved, using the us-west-2 (Oregon) Region.

Prerequisites:

  • The AWS CLI with the necessary credentials
  • A default VPC in your chosen region (or you can utilize an existing VPC in your account)

First, we will create the ECS cluster named ecs-bottlerocket.

aws ecs --region us-west-2 create-cluster --cluster-name ecs-bottlerocket

The instance we are launching will require an AWS Identity and Access Management (IAM) role to interact with both the ECS APIs and the Systems Manager Session Manager APIs. I have established an IAM role called ecsInstanceRole that includes both the AmazonSSMManagedInstanceCore and the AmazonEC2ContainerServiceforEC2Role managed policies.

The list of Bottlerocket Amazon Machine Images (AMIs) compatible with NVIDIA GPUs can be found in the AWS Systems Manager Parameter Store. Let’s retrieve the AMI ID for the latest Bottlerocket release (available for both x86_64 and aarch64 architectures). In this blog post, we will use the x86_64 AMI.

latest_bottlerocket_ami=$(aws ssm get-parameter --region us-west-2 
--name "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/latest/image_id" 
--query Parameter.Value --output text)

Next, we will list the subnets that are set up to allocate a public IP address.

aws ec2 describe-subnets 
--region us-west-2 
--filter=Name=vpc-id,Values=$vpc_id 
--query 'Subnets[?MapPublicIpOnLaunch == `true`].SubnetId'

To connect our EC2 instance to the ECS cluster, we need to provide some configuration details during instance creation. This will be saved in a file named userdata.toml.

cat > ./userdata.toml << 'EOF'
[settings.ecs]
cluster = "ecs-bottlerocket"
EOF

Now, let’s launch a Bottlerocket instance in one of the public subnets listed above. We opt for a public subnet in this blog post for easier debugging and connectivity. The p3.2xlarge instance type will be used, which features one NVIDIA Tesla V100 GPU.

aws ec2 run-instances 
--subnet-id subnet-bc8993e6 
--image-id $latest_bottlerocket_ami 
--instance-type p3.2xlarge 
--region us-west-2 
--tag-specifications 'ResourceType=instance,Tags=[{Key=bottlerocket,Value=quickstart}]' 
--user-data file://userdata.toml 
--iam-instance-profile Name=ecsInstanceRole

Next, we’ll create the task definition for our sample application.

cat > ./sample-gpu.json << 'EOF'
{
  "containerDefinitions": [
    {
      "memory": 80,
      "essential": true,
      "name": "gpu",
      "image": "nvidia/cuda:11.0-base",
      "resourceRequirements": [
         {
           "type":"GPU",
           "value": "1"
         }
      ],
      "command": [
        "sh",
        "-c",
        "nvidia-smi"
      ],
      "cpu": 100,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
           "awslogs-group": "/ecs/bottlerocket",
           "awslogs-region": "us-west-2",
           "awslogs-stream-prefix": "demo-gpu"
           }
      }
    }
  ],
  "family": "example-ecs-gpu"
}
EOF

In the task definition, we allocate one NVIDIA GPU to our task via the resourceRequirements parameter and configure the awslogs-group to send log outputs to Amazon CloudWatch.

Create the required CloudWatch log group as defined in the task configuration.

aws logs create-log-group --log-group-name '/ecs/bottlerocket' --region us-west-2

Next, register the task with ECS.

aws ecs register-task-definition 
--region us-west-2 
--cli-input-json file://sample-gpu.json

Run the task.

aws ecs run-task --cluster ecs-bottlerocket 
--task-definition bottlerocket-gpu:1

The task will execute and run a command inside the container to display the GPU configuration before exiting. You can view the stopped task in the ECS console. Click on the task ID and then navigate to the Logs tab to see the output.

You can also retrieve the log output directly from the command line by specifying the log group name, log stream name, and timeframe. For instance:

aws logs tail '/ecs/bottlerocket' 
--log-stream-names 'demo-gpu/gpu/7af782059c644872977da89a06023483' 
--since 1h --format short

Cleanup

To remove the resources created during this process, execute the following commands.

aws ecs deregister-task-definition 
--region us-west-2 
--task-definition bottlerocket-gpu:1

delete_instances=$(aws ec2 describe-instances --region us-west-2 
--filters "Name=tag-key,Values=bottlerocket" "Name=tag-value,Values=quickstart" 
--query 'Reservations[].Instances[].InstanceId')

for instance in $delete_instances
do aws ec2 terminate-instances --instance-ids $instance --region us-west-2
done

aws ecs delete-cluster 
--region us-west-2 
--cluster ecs-bottlerocket

aws logs delete-log-group --log-group-name '/ecs/bottlerocket'

Conclusion

In this article, we explored how to create an ECS task definition that enables running a GPU-accelerated workload in a container on Bottlerocket, efficiently and securely. We also demonstrated how to access container logs in CloudWatch and from the command line. For additional examples of GPU-accelerated workloads suitable for Bottlerocket on ECS, visit the NVIDIA GPU-optimized containers available in the NVIDIA NGC catalog on AWS Marketplace.

For more insights into navigating workplace policies, see this resource from SHRM. Additionally, if you’re interested in career opportunities, check out this onsite medical representative position at Amazon.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *