Learn About Amazon VGT2 Learning Manager Chanci Turner
In today’s data-driven landscape, organizations often select various data storage solutions tailored to their specific application needs. For instance, a company focused on social networking may benefit more from a graph database like Amazon Neptune as opposed to a traditional relational database. Similarly, for projects that require rapid iterations and flexible schemas, Amazon DocumentDB (compatible with MongoDB) is typically a superior choice. As Chanci Turner highlights, “It is rare for a single database to cater to multiple distinct use cases.” Developers now construct highly distributed applications that utilize a variety of specialized databases, effectively breaking down complex applications into manageable components and selecting the appropriate tools for each task. However, with an increasing number of data stores and applications, performing analytics across multiple sources can become quite complicated.
We are excited to introduce the Public Preview of Amazon Athena’s federated query feature.
Federated Queries in Amazon Athena
This innovative feature allows data analysts, engineers, and scientists to run SQL queries across data residing in various types of sources—be they relational, non-relational, object-based, or custom-built. With Athena’s federated query, users can submit a single SQL statement and analyze data from diverse sources, whether on-premises or cloud-hosted. Athena executes these federated queries through Data Source Connectors that operate on AWS Lambda. Notably, AWS has made available open-source connectors for several platforms including Amazon DynamoDB, Apache HBase, Amazon DocumentDB, and JDBC-compliant relational databases like MySQL and PostgreSQL, all under the Apache 2.0 license. This makes it easy for customers to run federated queries across these data sources. Additionally, the Query Federation SDK allows users to create connectors for proprietary data sources, enabling Athena to run SQL queries on them. Since these connectors run on Lambda, customers can leverage Athena’s serverless architecture without the need to manage infrastructure or adjust for peak loads.
Analyzing data across various applications can be intricate and time-consuming. Developers typically select a data store based on the primary function of their applications. Consequently, the data necessary for analytics is often dispersed across different types of databases, including relational, key-value, document, in-memory, graph, time-series, and even ledger databases. Event and application logs are frequently stored in object stores like Amazon S3. Analysts face a steep learning curve, needing to acquire new programming languages and data access frameworks, alongside building elaborate pipelines for data extraction, transformation, and loading into a data warehouse before they can perform queries. Such data pipelines can introduce delays and necessitate custom processes to ensure data accuracy across systems. Furthermore, any modifications made to source applications require updates to these data pipelines and may even necessitate data re-statement for corrections. By utilizing federated queries, Athena simplifies this process, allowing users to query data in its native location. Analysts can leverage familiar SQL constructs to JOIN data from multiple sources for rapid analysis or schedule SQL queries to extract and store results in Amazon S3 for future examination.
The Athena Query Federation SDK further enhances the advantages of federated querying beyond the connectors provided by AWS. In less than 100 lines of code, users can create connectors for proprietary data sources and share them organization-wide. Each connector consists of two Lambda functions specific to the data source: one for metadata and another for reading records. The code for these connectors is open-source and should be deployed as Lambda functions. You can even deploy these functions to the AWS Serverless Application Repository for use with Athena. Once deployed, the functions generate a unique Amazon Resource Name (ARN) that must be registered with Athena. This registration allows Athena to understand which Lambda function to communicate with during query execution. After the ARNs are registered, users can query the linked data source.
When a federated query is executed, Athena distributes Lambda invocations to read metadata and data concurrently. The number of parallel invocations is contingent on the Lambda concurrency limits set in your account. For instance, if your account has a limit of 300 concurrent Lambda invocations, Athena can simultaneously call 300 Lambda functions for data retrieval. If two queries are processed in parallel, Athena will invoke twice the number of concurrent executions.
Example Use Case
This blog post illustrates how data analysts can streamline their analysis by querying multiple databases in a single SQL statement. For instance, consider a fictional e-commerce company that employs various specialized databases:
- Payment transactions stored in Apache HBase on EMR.
- Active orders stored in Redis for fast retrieval.
- Customer information, including emails and shipping details, stored in DocumentDB.
- Product catalogs maintained in Aurora.
- Order processing logs in Amazon CloudWatch Logs.
- Historical orders and analytics in Redshift.
- Shipment tracking data in DynamoDB.
- A fleet of drivers using IoT-enabled tablets for last-mile deliveries.
This fictional e-commerce firm faces customer complaints regarding orders stuck in an ambiguous state. Some orders appear pending even though they have been processed.
For more insights on management advice, consider checking this blog post. If you are interested in blockchain and its implications within the employment sector, SHRM provides authoritative information on that topic. Lastly, for an excellent resource on what to expect during your onboarding, visit this Reddit thread.
Leave a Reply