Amazon Onboarding with Learning Manager Chanci Turner

Data lakes, business intelligence, operational analytics, and data warehousing all share a fundamental capability—the extraction, transformation, and loading (ETL) of data for analytics. Since its inception in 2017, AWS Glue has offered a serverless data integration service that simplifies the discovery, preparation, and amalgamation of data for analytics, machine learning, and application development.

AWS Glue interactive sessions empower developers to create, test, and execute data preparation and analytics applications. These sessions allow access to a fully managed serverless Apache Spark environment on an on-demand basis. Additionally, AWS Glue interactive sessions provide advanced users with the same Apache Spark engine as AWS Glue 2.0 or 3.0, complete with built-in cost controls and enhanced performance. This integration allows development teams to quickly become efficient using their preferred development tools.

In this article, we will guide you on how to utilize AWS Glue interactive sessions with PyCharm to develop AWS Glue jobs.

Solution Overview

This article offers a detailed walkthrough that expands on the guidelines provided in Getting started with AWS Glue interactive sessions. It takes you through these essential steps:

Create an AWS Identity and Access Management (IAM) policy that limits Amazon Simple Storage Service (Amazon S3) read permissions and the associated role for AWS Glue.
Configure access to a development environment, which can either be a local desktop or an OS running on the AWS Cloud via Amazon Elastic Compute Cloud (Amazon EC2).
Integrate AWS Glue interactive sessions with an integrated development environment (IDE).

For validation purposes, we will use the script Validate_Glue_Interactive_Sessions.ipynb, which is available as a Jupyter notebook.

Prerequisites

Before you begin, ensure you have an AWS account. If you don’t possess one yet, refer to How do I create and activate a new AWS account? This guide assumes that you have Python and PyCharm installed, with Python 3.7 or newer as a foundational prerequisite.

Create an IAM Policy

The initial step involves creating an IAM policy that restricts read access to the S3 bucket s3://awsglue-datasets, which contains the AWS Glue public datasets. Use IAM to define the policies and roles necessary for accessing AWS Glue.

On the IAM console, select Policies in the navigation pane.
Choose Create policy.
On the JSON tab, input the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3-object-lambda:Get*",
                "s3-object-lambda:List*"
            ],
            "Resource": ["arn:aws:s3:::awsglue-datasets/*"]
        }
    ]
}

Select Next: Tags.
Select Next: Review.
For Policy name, input glue_interactive_policy_limit_s3.
Provide a description.
Select Create policy.

Create an IAM Role for AWS Glue

Next, create a role for AWS Glue with limited Amazon S3 read permissions by following these steps:

On the IAM console, select Roles in the navigation pane.
Choose Create role.
For Trusted entity type, select AWS service.
For Use cases for other AWS services, choose Glue.
Select Next.
On the Add permissions page, search for and select the AWS managed permission policies AWSGlueServiceRole and glue_interactive_policy_limit_s3.
Choose Next.
For Role name, enter glue_interactive_role.
Select Create role.
Make a note of the role’s ARN: arn:aws:iam:::role/glue_interactive_role.

Set Up Development Environment Access

This secondary access configuration must be done within the developer’s environment. The development setup can be a desktop computer running Windows or Mac/Linux, or similar OS on AWS Cloud via Amazon EC2. The following steps guide you through the configurations for each environment.

Set Up a Desktop Computer

For desktop setups, we recommend following the steps outlined in Getting started with AWS Glue interactive sessions.

Set Up an AWS Cloud-based Computer with Amazon EC2

This configuration method adheres to best practices for granting access to cloud resources using IAM roles. For further details, see the guidelines on Using an IAM role to grant permissions to applications running on Amazon EC2 instances.

On the IAM console, select Roles in the navigation pane.
Choose Create role.
For Trusted entity type, select AWS service.
For Common use cases, select EC2.
Select Next.
Add the AWSGlueServiceRole policy to the newly created role.
On the Add permissions menu, create an inline policy that allows the instance profile role to assume glue_interactive_role, naming it ec2_glue_demo.

Your new policy will now appear under Permissions policies.

In the Amazon EC2 console, right-click the instance you want to attach to the newly created role.
Select Security and choose Modify IAM role.
For IAM role, select ec2_glue_demo.
Click Save.
In the IAM console, edit the trust relationship for glue_interactive_role.
Add "AWS": ["arn:aws:iam:::user/glue_interactive_user","arn:aws:iam:::role/ec2_glue_demo"] to the principal JSON key.
Complete the remaining steps as detailed in Getting started with AWS Glue interactive sessions.

You won’t need to provide an AWS access key ID or AWS secret access key for the subsequent steps.

Integrate AWS Glue Interactive Sessions with an IDE

You are now prepared to set up and validate your PyCharm integration with AWS Glue interactive sessions.

On the welcome page, select New Project.
For Location, enter the path for your project, glue-interactive-demo.
Expand Python Interpreter.
Select Previously configured interpreter and choose your previously created virtual environment.
Click Create.

The New Project page will display on your Mac, while Windows setups will show a relative path beginning with C: followed by the PyCharm project location.

Right-click on the project and choose Jupyter Notebook from the New menu.
Name the notebook Validate_Glue_Interactive_Sessions.

The notebook will feature a drop-down labeled Managed Jupyter server: auto-start, indicating that the Jupyter server will automatically activate when any notebook cell is executed.

Run the following code:

print("This notebook will start the local Python kernel")

You should see that the Jupyter server has begun executing the cell.

From the Python 3 (ipykernel) drop-down, select Glue PySpark.
Execute the following code to initiate a Spark session:

spark

Wait for the confirmation message indicating that a session ID has been generated.

In each cell, run the following boilerplate syntax for AWS Glue:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())

This comprehensive guide aims to equip you with the knowledge to effectively utilize AWS Glue interactive sessions within PyCharm to develop AWS Glue jobs. For additional insights on building your personal brand, you might find this blog post helpful. Moreover, for those interested in workplace dynamics, check out this authoritative piece on harassment laws in Ontario. Lastly, if you’re just starting with Amazon, this Reddit thread can be an excellent resource.