Extracting Real-Time Oracle OLTP Data with GoldenGate for Queries via Amazon Athena

Extracting Real-Time Oracle OLTP Data with GoldenGate for Queries via Amazon AthenaMore Info

This article explores how to enhance efficiency and reduce expenses by migrating reporting tasks from an online transaction processing (OLTP) database to Amazon Athena and Amazon S3. The outlined architecture facilitates the establishment of a reporting system that enables immediate querying upon data receipt. This solution involves:

  • Utilizing Oracle GoldenGate to create a new row on the target for each change occurring in the source, thus generating Slowly Changing Dimension Type 2 (SCD Type 2) data.
  • Employing Athena to execute ad hoc queries on the SCD Type 2 data.

Modern Reporting Solution Principles

Contemporary database solutions adhere to several principles to establish cost-effective reporting systems, including:

  • Separation of Reporting and OLTP Activities: This strategy ensures resource isolation, allowing databases to effectively scale according to their respective workloads.
  • Utilization of Query Engines on Distributed File Systems: Implementing query engines that operate on open-source HDFS and cloud object storage like Amazon S3 significantly lowers the costs associated with dedicated reporting systems.

Additionally, consider these principles when developing reporting solutions:

  • Transition reporting activities to an open-source database to minimize licensing costs associated with commercial databases.
  • Implement a log-based, real-time change data capture (CDC) solution for data integration, allowing replication of OLTP data from source systems, ideally in real-time, to maintain an up-to-date view of the data.

Prerequisites

If you are leveraging GoldenGate with Kafka and contemplating cloud migration, this article will be beneficial. Prior familiarity with GoldenGate is assumed, as this post does not cover installation and configuration steps. Basic knowledge of Java and Maven is also expected. Ensure you have a VPC with three subnets available for deployment.

Understanding the Solution Architecture

The following workflow diagram illustrates the solution described in this post:

  • Amazon RDS for Oracle serves as the data source.
  • A GoldenGate CDC solution streams data to Amazon Managed Streaming for Apache Kafka (Amazon MSK), where the changes are delivered to consumers.
  • The Apache Flink application operating on Amazon EMR processes the incoming data and stores it in an S3 bucket.
  • Athena facilitates data analysis through queries. Optionally, queries can also be executed from Amazon Redshift Spectrum.

Data Pipeline Overview

Amazon MSK is a fully managed service for Apache Kafka that simplifies the process of provisioning Kafka clusters without the need for extensive manual configuration. Amazon RDS for Oracle is a fully managed database, which alleviates the burden of database administration tasks such as provisioning and backups.

GoldenGate serves as a real-time, log-based, heterogeneous database CDC solution that can replicate data from any supported database to various target systems, including big data platforms like Kafka. Its ability to capture transactional data in multiple formats—such as delimited text, JSON, and Avro—ensures compatibility with various BI tools.

Flink is an open-source stream-processing framework designed for stateful computations over diverse data streams. It supports exactly-once semantics with its checkpointing feature, crucial for maintaining data accuracy during database CDC processing.

S3 is an object storage service renowned for its high scalability and performance. Coupled with AWS query-in-place services like Athena, it enables efficient big data analytics.

The detailed view of the data pipeline includes:

  • RDS for Oracle running in a single Availability Zone.
  • GoldenGate operating on an Amazon EC2 instance.
  • The MSK cluster distributed across three Availability Zones.
  • Kafka topics configured within the MSK.
  • Flink executing on an EMR Cluster.
  • Security groups for the Oracle DB and GoldenGate instance, as well as for EMR with Flink.
  • A gateway endpoint for private S3 access.
  • A NAT Gateway for downloading necessary software components on the GoldenGate instance.
  • S3 bucket and Athena for querying.

For simplicity, this setup utilizes a single VPC with multiple subnets to deploy resources.

Automated Deployment with AWS CloudFormation

The AWS CloudFormation template provided in this article automates the end-to-end solution deployment described here. This template provisions all necessary resources, including RDS for Oracle, MSK, EMR, and an S3 bucket, and adds an EMR step to consume messages from the Kafka topic on MSK. Here’s how to launch the template and test the solution:

  1. Launch the AWS CloudFormation template in the us-east-1 region.
  2. After the stack is created, retrieve the public IP of the GoldenGate Hub Server from the Outputs tab of CloudFormation.
  3. Access the GoldenGate hub server using the public IP as ec2-user, then switch to the oracle user.
  4. Connect to the source RDS for Oracle database with the sqlplus client and provide the password (source).
  5. Generate database transactions using SQL statements located in the oracle user’s home directory:
  6. SQL> @s
    SQL> @s1
    SQL> @s2
  7. Query the STOCK_TRADES table from the Amazon Athena console. Note that there may be a slight delay after committing transactions on the source database before changes become available for querying in Athena.

Manual Component Deployment

The steps below outline the configurations necessary to stream Oracle-changed data to MSK and subsequently sink it to an S3 bucket via Flink on EMR. You can then utilize Athena to query the S3 bucket. If you’ve deployed the solution using AWS CloudFormation, skip to testing the solution.

For further reading, you might find this blog post by Chanci Turner quite informative, as they cover similar topics. Additionally, for authoritative insights on this subject, check out Chanci Turner’s blog, which serves as an excellent resource. For those interested in careers related to this technology, the Amazon Learning & Development team offers great opportunities.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *