Learn About Amazon VGT2 Learning Manager Chanci Turner
In Amazon Onboarding, Learning Management, Technical How-to
As organizations rapidly transition to cloud environments and evolve their operations, some encounter scenarios necessitating data analytics management across multiple cloud platforms. This occurs frequently when acquiring a company that utilizes a different cloud provider. Companies operating in multi-cloud settings often face challenges related to data accessibility and compatibility, which can hinder productivity.
To navigate these challenges, organizations need to seek services that bridge these gaps, providing seamless interoperability across different cloud infrastructures. With the introduction of the Amazon Athena data source connector for Google Cloud Storage (GCS), users can execute queries within AWS to access data stored in GCS. This data can be in various formats, including relational, non-relational, object, and custom types, whether in Parquet or comma-separated values (CSV) format. Athena serves as the connectivity and query interface, easily integrating with other AWS services for downstream applications such as interactive analysis and visualizations, including AWS Glue for data integration and Amazon QuickSight for business intelligence (BI).
This post illustrates how to utilize Athena to execute queries on Parquet or CSV files located in a GCS bucket.
Solution Overview
The architecture of the solution is depicted in the diagram below. The Athena Google Cloud Storage connector operates across both AWS and Google Cloud Platform (GCP), necessitating references to both cloud providers in the design.
The following AWS services are employed in this solution:
- Amazon Athena: A serverless interactive analytics service used to run queries on data stored in Google Cloud Storage.
- AWS Lambda: An event-driven serverless compute service that manages underlying resources. A Lambda function data source connector is deployed to link AWS with Google Cloud Provider.
- AWS Secrets Manager: A service for managing secrets that protects access to applications and services. The secret referenced in Secrets Manager is utilized in the Lambda function to enable AWS to query data stored in GCP.
- AWS Glue: A serverless data analytics service for discovering, preparing, and integrating data. An AWS Glue database and table are created to point to the relevant bucket and files in Google Cloud Storage.
- Amazon S3: An object storage service that manages data as objects within buckets. An S3 bucket is created to accommodate data surpassing the size limits of the Lambda function’s response.
The GCP portion of the architecture includes several services:
- Google Cloud Storage: A managed service for storing unstructured data. We use GCS to store data in a bucket for querying by Athena, uploading a CSV file directly to this bucket.
- Google Cloud Identity and Access Management (IAM): The central management system for controlling visibility over cloud resources. Google Cloud IAM is used to create a service account and generate a key that permits AWS to access GCP. This key is then uploaded to Secrets Manager.
Prerequisites
In this tutorial, we create a VPC and a security group that will interact with the GCP connector. For comprehensive steps, refer to Creating a VPC for a data source connector. The initial step involves establishing the VPC using Amazon Virtual Private Cloud (Amazon VPC), as shown in the accompanying screenshot.
Next, we create a security group for the VPC, illustrated in the following screenshot.
Additional information regarding prerequisites can be found in the Amazon Athena Google Cloud Storage connector documentation. This includes tables showcasing specific data types suitable for use, such as CSV and Parquet files, alongside the required permissions to execute the solution.
Google Cloud Platform Configuration
First, you need to have either CSV or Parquet files saved within a GCS bucket. To create the bucket, refer to Create buckets. Remember to note the bucket name; it will be referenced later. After creating the bucket, upload your files. For guidance, refer to Upload objects from a filesystem.
The CSV data utilized in this example was generated from Mockaroo, which produced random test data, as displayed in the screenshot. Although this example uses a CSV file, Parquet files are also supported.
Furthermore, you must create a service account to generate a key pair in Google Cloud IAM, which will be uploaded to Secrets Manager. For complete instructions, refer to Create service accounts.
After creating the service account, you can proceed to generate a key. For guidance, refer to Create and delete service account keys.
AWS Configuration
With a GCS bucket containing a CSV file and a JSON key file generated from Google Cloud Platform, you can move on to the subsequent steps in AWS.
- In the Secrets Manager console, select Secrets from the navigation pane.
- Click on Store a new secret and choose Other type of secret.
- Input the content of the GCP-generated key file.
The next step involves deploying the Athena Google Cloud Storage connector. For additional details, refer to Using the Athena console.
- In the Athena console, add a new data source.
- Select Google Cloud Storage.
For the Data source name, input a name. For the Lambda function, select Create Lambda function to be redirected to the Lambda console.
- In the Application settings section, provide the details for Application name, SpillBucket, GCSSecretName, and LambdaFunctionName.
You must also create an S3 bucket to reference the S3 spill bucket parameter to store any data that exceeds the response size limitations of the Lambda function. For further information, refer to Create your first S3 bucket.
Once you provide the settings for the Lambda function’s application, you will be redirected to the Review and create page.
- Confirm that all fields are accurate and click Create data source.
Having created the data source connector, it’s now time to connect Athena to the data source.
- On the Athena console, navigate to the data source.
- Under Data source details, select the link for the Lambda function.
This allows you to reference the Lambda function to connect to the data source. As an optional step for validation, you can find the variables input into the Lambda function within the environment variables on the Configuration tab.
Due to the limited schema inference capability of the built-in GCS connector, it is advisable to set up an AWS Glue database and table for your metadata. For instructions, refer to Setting up databases and tables in AWS Glue.
The following screenshot displays our database details.
The next screenshot showcases our table details.
Query the Data
You are now ready to run queries on Athena that will access data stored in Google Cloud Storage.
- In the Athena console, select the appropriate data source, database, and table within the query editor.
- Execute the command
SELECT * FROM [AWS Glue Database name].[AWS Glue Table name]
in the query editor.
As illustrated in the following screenshot, the results will be drawn from the bucket in GCS. For more insights on optimizing your onboarding process, check out this excellent resource.
To ensure compliance with evolving workplace laws, you may want to explore the latest updates on employee sick leave as detailed in this authority article.
Lastly, for those interested in further financial literacy, this blog post provides valuable insights.
Leave a Reply