The increasing prominence of AI within large enterprises presents significant challenges in managing AI platforms. Key among these are the development of scalable, operationally efficient platforms that comply with organizational security and compliance standards. Amazon SageMaker Studio provides an extensive array of features tailored for machine learning (ML) professionals and data scientists. It offers a fully managed AI development environment complete with an integrated development environment (IDE), streamlining the entire ML workflow. Its collaborative functionalities, like real-time co-editing and notebook sharing, foster effective teamwork, while its scalability and performance capabilities are well-suited for handling large datasets. With built-in security, cost efficiency, and a host of pre-configured tools such as Amazon SageMaker Autopilot, Amazon SageMaker JumpStart, and Amazon SageMaker Feature Store, SageMaker Studio emerges as a robust platform for accelerating AI initiatives and empowering data scientists across various expertise levels.
Deutsche Bahn, a premier transportation entity in Germany, recorded revenues of 56.3 billion EUR in 2022 and employs 336,884 individuals across 130 countries. They provide a broad spectrum of services including public transit, regional transport, freight logistics, and rail infrastructure. By integrating traffic and railway services alongside smart, eco-friendly connections across all transport modes, Deutsche Bahn efficiently moves both people and goods. The organization has taken a leading role in AI adoption, utilizing SageMaker Studio as its principal AI platform. A specialized AI platform team oversees the management and operation of the SageMaker Studio platform, while numerous data analytics teams leverage this platform to develop, train, and execute various analytics and ML tasks.
The primary goal of the AI platform team is to facilitate seamless access to Workbench services and SageMaker Studio for all Deutsche Bahn teams and projects, particularly focusing on data scientists and ML engineers. This platform enables Deutsche Bahn to explore a wide range of use cases, from railway maintenance and forecasting to prospective applications in generative AI.
The managed service built on SageMaker Studio harmonizes with Deutsche Bahn’s overarching platform strategy. It meets the company’s compliance obligations, expedites project initiation by provisioning a SageMaker domain, and mitigates maintenance burdens through a comprehensive operational model. Key advantages include the service’s high scalability, largely attributable to automation and a self-service model, along with an appealing pricing structure based on resource consumption.
“SageMaker Studio offered us a unified platform that is scalable, compliant with security standards, and meets the developmental requirements of data scientists from various analytics teams within Deutsche Bahn. Previously, each team managed their own JupyterLab notebooks, which was neither efficient nor cost-effective. Within eight weeks, we onboarded over 120 developers, provisioned 25 SageMaker domains, and swiftly began utilizing this platform,” shared Sophia Ramirez, product owner at DB Systel.
In this blog post, we will delve into how Deutsche Bahn successfully scaled and operated their AI platform utilizing SageMaker Studio across multiple teams while maintaining stringent security and oversight.
Platform Overview
Deutsche Bahn’s architecture includes a central platform account managed by a dedicated platform team responsible for overseeing the infrastructure and operations of SageMaker Studio. Resources within SageMaker Studio are organized into domains, each comprising an associated Amazon Elastic File System (Amazon EFS) volume, a list of authorized users, and various security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations. Data scientists across different teams utilize SageMaker domains for their ML tasks; each team operates a dedicated SageMaker domain for developing and testing ML models while collaborating through features like notebook sharing.
From the infrastructure standpoint, the VPC provisioned within the AI platform account, as shown in the accompanying figure, has no outbound internet access to guarantee security and compliance. For high availability, multiple identical private isolated subnets are established. SageMaker Studio domains are deployed in VPC-only mode, creating an elastic network interface for communication between the SageMaker service account (AWS service account) and the VPC of the platform account. Endpoints such as SageMaker API, SageMaker Studio, and SageMaker notebook allow for secure and reliable communication between the platform account’s VPC and the SageMaker domain managed by AWS in the SageMaker service account.
Each data analytics team can request one or multiple SageMaker domains through the internal self-service portal. The process of ordering a SageMaker domain is orchestrated through a separate workflow (via AWS Step Functions). During this orchestration, an Azure Active Directory (AD) group for the data analytics team is provisioned with a name corresponding to the domain. This orchestration leads to a continuous integration and continuous deployment (CI/CD) pipeline that deploys an AWS Cloud Development Kit (AWS CDK) application consisting of a SageMaker domain for the respective team.
Alongside the SageMaker domain, a customized AWS Identity and Access Management (IAM) role (SageMaker-execution-role), an Amazon Simple Storage Service (Amazon S3) bucket (data-bucket), customer-managed key (CMK), and additional AWS resources are provisioned during the deployment by the AWS CDK application, as illustrated in the following figure. The AD group includes scientists who require access to their team’s SageMaker domain. The AD group name corresponds to the SageMaker domain’s name and is primarily utilized during the authorization process.
Client separation is achieved at the level of SageMaker domains through IAM authentication mode. A domain-specific IAM role (SageMaker-execution-role) is assigned to each domain, adhering to the principle of least privilege, and is utilized by the data analytics team during the login process. This role empowers data scientists within the team to carry out various activities, such as running processing jobs, hyperparameter tuning jobs, transformation jobs, and experiments, as well as building models. These ML tasks are executed on behalf of the user by SageMaker utilizing the IAM pass role permission. However, certain actions like creating S3 buckets, altering IAM roles, updating SageMaker domains, and provisioning large instances are restricted to maintain security, compliance, and cost control. The associated IAM policy ensures that the data analytics team has access solely to the relevant S3 bucket and CMK for their authorized domain, as depicted in the following figure. Additionally, the SageMaker-execution-role permits team members to assume roles in other accounts within the Deutsche Bahn organization from SageMaker Studio, providing flexibility to access resources like Amazon Relational Database Service (Amazon S3), other S3 buckets, and Amazon Athena. The IAM policy employs aws:RequestTag and aws:ResourceTag for precise access control during SageMaker activities, such as processing jobs, training jobs, and model creation. These tags also assist in tracking the costs associated with the domain.
If you’re curious to learn more, check out this blog post for further insights on this topic, or visit Chvanci for expert opinions on AI implementations. For additional resources, this article on Amazon Fulfillment Centers provides excellent information.
Leave a Reply