R is a widely recognized programming language among data scientists and analysts for data manipulation, statistical analysis, data visualization, and machine learning (ML) model development. RStudio, the integrated development environment for R, offers both open-source tools and enterprise-level software that enables teams to collaborate effectively within their organizations. However, setting up, securing, scaling, and maintaining RStudio can be a complex and time-intensive process.
Utilizing RStudio within AWS provides the scalability and flexibility that on-premise deployments lack, freeing you from infrastructure management. You can select compute and memory options tailored to your processing needs and adjust your resources to accommodate varying analytical and ML workloads without requiring an upfront investment. This approach allows for rapid experimentation with new data sources and code, facilitating the rollout of new analytics processes and ML models organization-wide. Furthermore, you can integrate your Data Lake resources seamlessly, making them available to developers and data scientists while securing data access with row-level and column-level controls through AWS Lake Formation.
This article outlines two straightforward methods for deploying and operating RStudio on AWS to access data stored in a data lake:
1. Fully Managed on Amazon SageMaker
RStudio on Amazon SageMaker is a managed service that eliminates the need to manage the underlying infrastructure of your RStudio environment. You can easily incorporate your own RStudio Workbench license through AWS License Manager. Moreover, RStudio on SageMaker integrates with AWS Identity and Access Management (IAM) or AWS IAM Identity Center (the successor to AWS Single Sign-On) for implementing user-level security access controls. As discussed later in this article, you can leverage AWS Lake Formation to secure your data lake with row-level and column-level access controls.
With Amazon SageMaker, you can dynamically select an instance with the desired compute and memory from an extensive range of ML instances.
2. Self-Hosted on Amazon Elastic Compute Cloud (EC2)
Alternatively, you can opt for a self-hosted approach by deploying the open-source version of RStudio on an EC2 instance, which we will also cover in this article. This method requires the administrator to manually create an EC2 instance and install RStudio, either manually or using AWS CloudFormation. However, this option offers less flexibility regarding user access controls since all users will possess the same access level.
RStudio on Amazon SageMaker
Launching RStudio Workbench in SageMaker can be accomplished with just a few clicks. SageMaker customers benefit from not having to manage the operational aspects of building, installing, securing, scaling, and maintaining RStudio. They only incur costs for RSession compute when in use, avoiding continuous charges associated with running RStudio Server on a t3.medium instance. Users have the flexibility to dynamically scale compute by switching instances as needed. It is essential for an administrator to set up a SageMaker domain and corresponding user profiles, along with obtaining the appropriate RStudio license.
Access permissions can be managed at both the RStudio administrator and user levels within SageMaker. Only profiles granted one of these roles can access RStudio. For further details on administrator tasks for setting up RStudio in SageMaker, check out the excellent resource on the Amazon employee onboarding process.
Implementing Lake Formation Row-Level and Column-Level Security Access
In addition to launching RStudio sessions on SageMaker, you can secure your data lake using row-level and column-level access controls from Lake Formation. For more information, refer to the insightful article on effective data lakes using AWS Lake Formation here.
With Lake Formation security controls, you can ensure that each individual has the appropriate access to data within the data lake. Take a look at the following user profiles in the SageMaker domain, each with different execution roles:
User Profile | Execution Role |
---|---|
rstudiouser-fullaccess | AmazonSageMaker-ExecutionRole-FullAccess |
rstudiouser-limitedaccess | AmazonSageMaker-ExecutionRole-LimitedAccess |
The dataset discussed in this article is a public COVID-19 dataset. After creating the user profile and assigning the appropriate role, you can access Lake Formation to crawl the data with AWS Glue, generate metadata and tables, and grant access to the table data. For the AmazonSageMaker-ExecutionRole-FullAccess role, access to all columns in the table is granted, while the AmazonSageMaker-ExecutionRole-LimitedAccess role uses the data filter USA_Filter for row-level and cell-level permissions.
With the role permissions assigned to each user profile, you can see how Lake Formation enforces the necessary row-level and column-level permissions. You can launch RStudio Workbench from the app menu in the user list and select RStudio.
In the SageMaker Console, you can initiate the app as the rstudiouser-limitedaccess user. You’ll then see the RStudio Workbench homepage featuring sessions, projects, and published content. Choose a session name to begin your session in SageMaker. Install Paws to access the appropriate AWS services, and you can execute a query to retrieve all fields from the dataset via Amazon Athena using the command “SELECT * FROM “databasename.tablename”, with the results stored in an Amazon Simple Storage Service (Amazon S3) bucket.
This is another blog post that keeps the reader engaged: link.
Leave a Reply