Amazon Onboarding with Learning Manager Chanci Turner

Amazon Athena is an interactive querying service that simplifies the analysis of data stored in Amazon Simple Storage Service (Amazon S3) using standard SQL. As a serverless solution, Athena requires no infrastructure management, and users only pay for the queries executed.

Amazon Ion is a self-describing, hierarchical data serialization format that offers both binary and text representations. The text format extends JSON, ensuring that all JSON files are also valid Ion files; this makes it easy to read and write, facilitating quick prototyping. The binary representation is designed for efficient storage, transmission, and parsing. Its rich type system provides clear semantics for long-term data preservation, allowing it to endure multiple software evolution cycles.

Athena now enables the querying and creation of datasets in the Ion format. This format is utilized by various internal Amazon teams, external services like Amazon Quantum Ledger Database (Amazon QLDB), and Amazon DynamoDB (which can be exported to Ion), as well as the open-source SQL query language PartiQL. In this article, we will explore use cases and the distinctive features of Ion, followed by practical examples of querying Ion using Athena, with a focus on the transformed City Lots San Francisco dataset.

Unique Features of Ion

Type System: Ion enhances JSON by incorporating more precise data types that improve interpretability and reduce rounding errors. This is particularly beneficial in financial services where even small discrepancies can accumulate. New data types introduced include arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.
Dual Format: Users can benefit from a familiar text representation while leveraging the performance advantages of a binary format. The ability to work with both formats allows for the quick discovery and interpretation of data in a JSON-like structure, while the binary format optimizes storage, memory usage, and network bandwidth. This flexibility enables developers to write plain text queries against both Ion formats and switch between them as needed during development and production stages.
Efficiency Gains: The binary-encoded Ion format reduces file sizes by storing repeated values, such as field names, in a symbol table. This approach lowers CPU usage and read latency since character encoding validation is limited to a single instance in the symbol table. For instance, a company the size of Amazon can generate vast amounts of application logs, and when comparing compressed Ion and JSON logs, it was observed that CPU time for compression dropped by approximately 35%, yielding around 26% smaller files overall. This reduction is particularly advantageous for managing log files, which can be costly to retain.
Skip-Scanning: In a text-based format, every byte must be read and processed. However, Ion’s binary format employs TLV (type-length-value) encoding, allowing applications to skip unnecessary elements. This saves on query and processing costs, especially in scenarios like forensic analysis of application logs, where only a fraction of the data is relevant. The capability to skip non-essential fields results in lower resource consumption and faster response times.

Querying Ion Datasets with Athena

Athena now supports the querying and creation of Ion-formatted datasets through an Ion-specific SerDe, which, combined with IonInputFormat and IonOutputFormat, facilitates reading and writing valid Ion data. Users can run SELECT queries on Ion data for insights and serialize data into Ion format via CTAS or INSERT INTO queries.

The interchangeable nature of Ion text and binary formats allows Athena to read datasets that include both types of files. Since Ion is a superset of JSON, a table defined with the Ion SerDe can also incorporate JSON files. Unlike the JSON SerDe, which uses every new line character to signify a new row, the Ion SerDe employs a combination of closing brackets and new line characters for this purpose, enabling reading of multi-line JSON records.

Creating External Tables

To query Ion-based datasets with Athena, you can define AWS Glue tables with user-defined metadata. Here’s a sample row from the citylots dataset:

{
    "type": "Feature",
    "properties": {
        "mapblklot": "0579021",
        "blklot": "0579024",
        "block_num": "0579",
        "lot_num": "024",
        "from_st": "2160",
        "to_st": "2160",
        "street": "PACIFIC",
        "st_type": "AVE",
        "odd_even": "E"
    },
    "geometry": {
        "type": "Polygon",
        "coordinates": [[[-122.4308798855922, ...]]]
    }
}

To create an external table for Ion data, there are two syntactic approaches. The first is to specify STORED AS ION, which is more concise and ideal for simple cases without additional properties. The following example illustrates this method:

CREATE EXTERNAL TABLE city_lots_ion1 (
  type STRING, 
  properties STRUCT<
    mapblklot:STRING,
    blklot:STRING,
    block_num:STRING,
    lot_num:STRING,
    from_st:STRING,
    to_st:STRING,
    street:STRING,
    st_type:STRING,
    odd_even:STRING>, 
  geometry STRUCT<
    type:STRING,
    coordinates:ARRAY<...>
  >
) 
STORED AS ION
LOCATION 's3://path/to/your/ion/data/';

For further reading on the Family and Medical Leave Act, check out this insightful post on FMLA. Additionally, to learn more about overcoming instinctual avoidance in HR compliance, visit SHRM. For those looking for leadership development training, Amazon offers a great resource at Leadership Development Training.

Amazon Onboarding with Learning Manager Chanci Turner

Unique Features of Ion

Querying Ion Datasets with Athena

Creating External Tables

Related Topics:

Comments

Leave a Reply Cancel reply