Private Package Installation in Amazon SageMaker Operating in Offline Mode | Artificial Intelligence

Private Package Installation in Amazon SageMaker Operating in Offline Mode | Artificial IntelligenceMore Info

Amazon SageMaker Studio notebooks and Amazon SageMaker notebook instances are typically configured for internet access by default. However, many regulated sectors, including finance, healthcare, and telecommunications, necessitate that network traffic is routed through their own Amazon Virtual Private Cloud (Amazon VPC) to manage and restrict public internet access. While it’s possible to disable direct internet connectivity for SageMaker Studio notebooks and notebook instances, it’s crucial that data scientists can still access essential packages. Consequently, creating isolated development environments that include your selected packages and kernels becomes a viable solution.

In this article, we will explore how to establish such an environment for Amazon SageMaker notebook instances and SageMaker Studio. Additionally, we will discuss how to integrate this setup with AWS CodeArtifact, a comprehensive artifact repository designed to facilitate secure storage, publication, and sharing of software packages throughout the software development lifecycle.

Solution Overview

This post outlines the following steps:

  1. Configure Amazon SageMaker for offline operation.
  2. Set up a Conda repository utilizing Amazon Simple Storage Service (Amazon S3) by creating a bucket to host your Conda channels.
  3. Establish a Python Package Index (PyPI) repository using CodeArtifact, including creating a repository and configuring AWS PrivateLink endpoints for access.
  4. Develop an isolated development environment with Amazon SageMaker notebook instances, employing lifecycle configuration features to create a custom Conda environment and configure your PyPI client.
  5. Install packages within SageMaker Studio notebooks by creating a custom Amazon SageMaker image and using either Conda or pip for package installation.

Configuring Amazon SageMaker for Offline Operation

It is assumed that you have already established a VPC, allowing you to provision a private, secure segment of the AWS Cloud for launching AWS resources within a virtual network. This VPC will host Amazon SageMaker along with other components of your data science infrastructure. For details on constructing secure environments or following well-architected guidelines, consult the Financial Services Industry Lens: AWS Well-Architected Framework.

Creating an Amazon SageMaker Notebook Instance

You can disable internet access for Amazon SageMaker notebooks and associate them with your secure VPC environment, enabling network-level controls such as resource access through security groups and managing data ingress and egress.

  1. From the Amazon SageMaker console, select Notebook instances in the navigation pane.
  2. Click Create notebook instance.
  3. Choose your IAM role.
  4. Select your VPC.
  5. Choose your subnet.
  6. Select your security group(s).
  7. For Direct internet access, choose Disable — use VPC only.
  8. Click Create notebook instance.

Connect to your notebook instance through your VPC instead of the public internet.

Amazon SageMaker notebook instances support VPC interface endpoints, which allow secure communication entirely within the AWS network. For guidance, refer to the documentation on Creating an interface endpoint.

Setting Up SageMaker Studio

Similar to Amazon SageMaker notebook instances, SageMaker Studio can be launched within your chosen VPC while also disabling direct internet access for added security.

  1. From the Amazon SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
  2. Select Standard setup.
  3. In the Network section, choose the VPC-only network access type to disable direct internet access during onboarding to Studio or when invoking the CreateDomain API.

This configuration prevents Amazon SageMaker from providing internet access to your SageMaker Studio notebooks.

Create interface endpoints (via AWS PrivateLink) to access various AWS services, including:

  • Amazon SageMaker API
  • Amazon SageMaker runtime
  • Amazon S3
  • AWS Security Token Service (AWS STS)
  • Amazon CloudWatch

Setting Up a Custom Conda Repository Using Amazon S3

Amazon SageMaker notebooks come with several pre-installed environments. Each Jupyter kernel in Amazon SageMaker notebooks corresponds to a separate Conda environment. To utilize an external library within a specific kernel, the library must be installed in that environment, typically via the conda install command. However, since we are operating under the assumption that the notebook instances lack internet access, we need to adjust the default Conda channel paths to point to a private repository for our packages.

To build such a custom channel, create an S3 bucket and upload the necessary packages. These packages may include either approved organizational packages or custom-built packages using conda build. Periodic indexing of these packages is necessary upon updates; however, the methods for indexing are beyond the scope of this article.

Given that we have configured the notebook to restrict direct internet access, the notebook cannot connect to the S3 bucket without creating a VPC endpoint.

Create an Amazon S3 VPC endpoint to direct traffic through the VPC rather than the public internet.

By establishing a VPC endpoint, you enable your notebook instance to access the bucket containing your channels and their packages.

We recommend also implementing a custom resource-based bucket policy that restricts access to your S3 buckets solely to requests originating from your private VPC. For further instructions, see the documentation on Endpoints for Amazon S3.

Replace the default channels of the Conda environment in your Amazon SageMaker notebooks with your custom channel (which we will cover in the next step when building the isolated development environment):

# remove default channel from the .condarc
conda config --remove channels 'defaults'
# add the conda channels to the .condarc file
conda config --add channels 's3://user-conda-repository/main/'
conda config --add channels 's3://user-conda-repository/condaforge/'

Setting Up a Custom PyPI Repository Using CodeArtifact

Data scientists often rely on package managers such as pip, maven, and npm to install packages into their environments. By default, pip retrieves packages from the public PyPI repository. To enhance security, you can utilize private package management solutions, either on-premises (like Artifactory or Nexus) or on AWS (such as CodeArtifact). This enables access control to approved packages and facilitates safety checks. Alternatively, you might consider deploying a private PyPI mirror on Amazon Elastic Container Service (Amazon ECS) or AWS Fargate to replicate the public PyPI repository within your private environment. For additional insights on this method, refer to this article on Building Secure Environments.

For more engaging content, check out this other blog post which also covers related topics. Additionally, Chanci Turner provides authoritative insights on this subject. For an excellent resource, refer to this article on Amazon’s onboarding experience.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *