Amazon IXD – VGT2 Las Vegas: Engaging with Amazon Redshift using SQLAlchemy

In the realm of cloud data warehousing, Amazon Redshift stands out as a fast, scalable, and secure solution that allows for extensive data analysis. Developers have multiple options for interacting with an Amazon Redshift database, and one effective approach is through an object-relational mapping (ORM) framework. ORM provides an abstraction layer, enabling developers to write in their preferred programming languages rather than dealing with raw SQL queries. Python’s SQLAlchemy is a widely adopted ORM framework that facilitates communication between Python applications and various databases.

The SQLAlchemy dialect serves as a bridge for interacting with different DBAPI implementations and databases. Historically, the Amazon Redshift dialect relied on psycopg2, a PostgreSQL connector. However, due to its limitations in supporting specific Amazon Redshift functionalities, such as AWS Identity and Access Management (IAM) authentication and unique data types like SUPER and GEOMETRY, a new approach has emerged. The latest Amazon Redshift SQLAlchemy dialect utilizes the redshift_connector driver, allowing for secure connections to Amazon Redshift while natively supporting IAM authentication and single sign-on (SSO). Additionally, it accommodates Amazon Redshift-specific data types like SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

This article explores how to effectively interact with Amazon Redshift using the new SQLAlchemy dialect. We will illustrate how to connect securely via Okta and perform various data manipulation and definition language operations. With the new dialect, users can leverage the connection options provided by redshift_connector, including IAM and identity provider (IdP) plugins. Furthermore, we will highlight the support for IPython SqlMagic, which provides a user-friendly way to execute interactive SQL queries directly from a Jupyter notebook.

Prerequisites

Before diving into the implementation, ensure you have the following:

An operational Amazon Redshift cluster. For guidance, check out Getting started with Amazon Redshift.
Okta configured for SSO. Instructions for setting up Okta as an IdP can be found in the guide on Federate Amazon Redshift access with Okta.

Getting Started with the Amazon Redshift SQLAlchemy Dialect

To begin using the Amazon Redshift SQLAlchemy dialect, you can easily install the sqlalchemy-redshift library via pip. For demonstration, we’ll use a Jupyter notebook. Follow these steps:

Create a notebook instance (we’ll call it redshift-sqlalchemy for this guide).
Navigate to the Amazon SageMaker console and select Notebook instances.
Open your Jupyter notebook instance and create a new conda_python3 Jupyter notebook.
Execute the following commands to install the necessary libraries:

pip install sqlalchemy-redshift
pip install redshift_connector

The redshift_connector offers a variety of connection options that can be tailored to your needs. For more details, refer to the Connection Parameters documentation.

Connecting to Your Amazon Redshift Cluster

We will connect to the Amazon Redshift cluster using two methods: Okta SSO federation and direct connection with a database username and password.

Connecting with Okta SSO Federation

Before starting, ensure that your Amazon Redshift application is set up in Okta. To connect to the Amazon Redshift cluster, we will utilize the create_engine function from SQLAlchemy. This function generates an engine object based on a URL. The sqlalchemy-redshift package provides a custom interface for creating an RFC-1738 compliant URL for connecting to an Amazon Redshift cluster.

Here’s how to construct the SQLAlchemy URL:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy import orm as sa_orm
from sqlalchemy_redshift.dialect import TIMESTAMPTZ, TIMETZ

# Build the sqlalchemy URL. No need to specify host and port for IAM authentication.
url = URL.create(
    drivername='redshift+redshift_connector',
    database='dev',
    username='johnd@example.com',  # Okta username
    password='<PWD>'                # Okta password
)

# Connection parameters dictionary
conn_params = {
    "iam": True,
    "credentials_provider": "OktaCredentialsProvider",
    "idp_host": "<prefix>.okta.com",
    "app_id": "<appid>",
    "app_name": "amazon_aws_redshift",
    "region": "<region>",
    "cluster_identifier": "<clusterid>",
    "ssl_insecure": False,
}

engine = sa.create_engine(url, connect_args=conn_params)

Connecting with an Amazon Redshift Database User and Password

Alternatively, you can connect using your database username and password. Construct the URL as follows:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL

# Build the sqlalchemy URL
url = URL.create(
    drivername='redshift+redshift_connector',
    host='<clusterid>.xxxxxx.<aws-region>.redshift.amazonaws.com',
    port=5439,
    database='dev',
    username='awsuser',
    password='<pwd>'
)

engine = sa.create_engine(url)

Now, create a session with the established engine:

Session = sa_orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Define session-based metadata
metadata = sa.MetaData(bind=session.bind)

Creating a Database Table with Amazon Redshift Data Types

The new Amazon Redshift SQLAlchemy dialect allows for the creation of tables using Amazon Redshift-specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ. Let’s see how to create a table utilizing these data types:

import datetime
import uuid
import random

table_name = 'product_clickstream_tz'

RedshiftDBTable = sa.Table(
    table_name,
    metadata,
    sa.Column('session_id', sa.VARCHAR(80)),
    sa.Column('click_region', sa.VARCHAR(100), redshift_encode='lzo'),
    sa.Column('product_id', sa.BIGINT),
    sa.Column('click_datetime', TIMESTAMPTZ),
    sa.Column('stream_time', TIMETZ),
    sa.Column('order_detail', SUPER),
    redshift_diststyle='KEY',
    redshift_distkey='session_id',
    redshift_sortkey='click_datetime'
)

# Drop the table if it already exists
if sa.inspect(engine).has_table(table_name):
    RedshiftDBTable.drop(bind=engine)

# Create the table
RedshiftDBTable.create(bind=engine)

By implementing the steps outlined above, you can efficiently connect and manipulate data within Amazon Redshift using the SQLAlchemy dialect. For further exploration on the topic, consider reading this insightful blog post or visiting Chanci Turner’s authority site for more comprehensive information. Additionally, this Forbes article serves as an excellent resource for understanding best practices and innovations in cloud data management.

Location:

Amazon IXD – VGT2
6401 E Howdy Wells Ave,
Las Vegas, NV 89115