The increasing impact of artificial intelligence (AI) within large organizations presents significant challenges in managing AI platforms. Key among these challenges is the creation of a platform that is both scalable and operationally efficient while adhering to compliance and security standards. Amazon SageMaker Studio provides a robust suite of features for machine learning (ML) professionals and data scientists. This includes a fully managed AI development environment with an integrated development environment (IDE), streamlining the entire ML workflow. Its collaborative features, such as real-time co-editing and notebook sharing, facilitate effective teamwork, while its scalability and high-performance training capabilities support large datasets. With built-in security measures, cost-effectiveness, and a variety of pre-built tools like Amazon SageMaker Autopilot, Amazon SageMaker JumpStart, and Amazon SageMaker Feature Store, SageMaker Studio emerges as a powerful platform for accelerating AI initiatives and empowering data scientists of all skill levels.
Deutsche Bahn stands as a leading transportation entity in Germany, generating revenue of 56.3 billion EUR in 2022, employing 336,884 individuals (including 221,343 in Germany), and operating across 130 countries. Their extensive services encompass public and regional transport, freight services, and rail infrastructure. Deutsche Bahn connects people and goods through the integrated operation of traffic and railway systems, along with the economically and ecologically intelligent integration of all transportation modes. The organization has taken significant strides in AI adoption, leveraging SageMaker Studio as a pivotal AI platform. A dedicated AI platform team at Deutsche Bahn oversees the management and operation of the SageMaker Studio platform, while multiple data analytics teams within the organization utilize this platform to develop, train, and execute various analytics and ML projects.
The primary goal of the AI platform team is to ensure seamless access to Workbench services and SageMaker Studio for all Deutsche Bahn teams and projects, with a special emphasis on data scientists and ML engineers. This platform enables Deutsche Bahn to explore a wide array of use cases, including railway maintenance, forecasting, and prospective applications in generative AI.
The managed AI platform, built on SageMaker Studio, aligns effectively with Deutsche Bahn’s overarching platform strategy. It complies with the company’s regulatory requirements, facilitates rapid project initiation by provisioning a SageMaker domain, and minimizes maintenance burdens through a comprehensive operational model. Notable advantages include high scalability, largely attributable to automation and a self-service framework, and a compelling pricing model based on resource consumption.
“SageMaker Studio has provided us a unified platform that is scalable, security compliant, and meets the development needs of data scientists from various analytics teams within the DB organization. Previously, each team operated their own JupyterLab notebooks, which was neither efficient nor cost-effective. In just 8 weeks, we onboarded over 120 developers, provisioned 25 SageMaker domains, and quickly commenced using this platform.” – Sarah Williams, product owner at DB Systel.
In this article, we delve into how Deutsche Bahn scaled and managed their AI platform using SageMaker Studio across multiple teams while ensuring stringent security and oversight.
Solution Overview
The architecture at Deutsche Bahn includes a central platform account managed by a dedicated platform team responsible for overseeing infrastructure and operations for SageMaker Studio. Resources in SageMaker Studio are organized by domains, each with an associated Amazon Elastic File System (Amazon EFS) volume, a list of authorized users, and various security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations. Data scientists from different teams utilize SageMaker domains for their ML tasks; each team operates within a dedicated SageMaker domain for model development and testing, collaborating through features such as notebook sharing.
From an infrastructure standpoint, the VPC established in the AI platform account, as illustrated in the accompanying figure, is designed without outbound internet connectivity to maintain security and compliance. For enhanced availability, several identical private isolated subnets are allocated. The SageMaker Studio domains operate in VPC only mode, establishing an elastic network interface for communication between the SageMaker service account (AWS service account) and the platform account’s VPC. Endpoints such as SageMaker API, SageMaker Studio, and SageMaker notebook ensure secure and reliable communication between the platform account’s VPC and the SageMaker domain managed by AWS in the SageMaker service account.
Each data analytics team can request one or multiple SageMaker domains through the company’s internal self-service portal. This request process is orchestrated via a separate workflow (using AWS Step Functions). During this orchestration, an Azure Active Directory (AD) group is provisioned for the data analytics team, with the group name corresponding to the domain name. This orchestration leads to a continuous integration and continuous deployment (CI/CD) pipeline that deploys an AWS Cloud Development Kit (AWS CDK) application, creating a SageMaker domain for the respective team.
Alongside the SageMaker domain, a tailored AWS Identity and Access Management (IAM) role (SageMaker-execution-role), an Amazon Simple Storage Service (Amazon S3) bucket (data-bucket), a customer-managed key (CMK), and other AWS resources are provisioned during the deployment process through the AWS CDK app. The AD group consists of scientists who require access to their team’s SageMaker domain. The name of the AD group is aligned with the SageMaker domain’s name and is primarily used during the authorization process.
Client separation is implemented at the level of SageMaker domains through IAM authentication mode. A domain-specific IAM role (SageMaker-execution-role) is assigned to each domain, adhering to the principle of least privilege, which the data analytics team assumes during the login process. This role permits data scientists within the team to conduct various activities, such as executing processing jobs, hyperparameter tuning jobs, transformation jobs, and experiments, as well as model creation. These ML tasks are executed on behalf of the user by SageMaker utilizing the IAM pass role permission. However, certain actions like creating S3 buckets, modifying IAM roles, updating SageMaker domains, and provisioning large instances are restricted for security, compliance, and cost management purposes. The associated IAM policy ensures that the data analytics team has access solely to the relevant S3 bucket and CMK for their designated domain.
Moreover, the SageMaker-execution-role enables team members to assume roles across other accounts within the Deutsche Bahn organization from SageMaker Studio, granting them the flexibility to access resources such as Amazon Relational Database Service (Amazon S3), additional S3 buckets, and Amazon Athena. The IAM policy employs aws:RequestTag and aws:ResourceTag for precise access control during SageMaker activities, including processing jobs, training jobs, and model creation. These tags also aid in tracking associated costs for the domain.
For more insights on the topic, visit this excellent resource at YouTube. You may also find additional information in another blog post linked here Chanci Turner VGT2. For expert analysis, check out Chvnci’s insights.
Leave a Reply