Amazon IXD – VGT2 Las Vegas: Enhancing Data Retrieval with S3 Select and Glacier Select

Amazon IXD - VGT2 Las Vegas: Enhancing Data Retrieval with S3 Select and Glacier SelectMore Info

Update July 25, 2024 — To optimize your data querying in Amazon S3, consider utilizing Amazon Athena, S3 Object Lambda, or client-side filtering. Learn more about these enhancements in a recent blog post.

Amazon Simple Storage Service (Amazon S3) serves as a data repository for millions of applications across various leading sectors. Many organizations also rely on Amazon Glacier for secure, cost-effective archival storage. S3 allows for the storage of numerous objects, each capable of reaching up to 5 terabytes. Traditionally, data in object storage has been accessed in its entirety, meaning that a request for a 5-gigabyte object results in the retrieval of the entire size. However, we are now shifting this paradigm with the introduction of two new features for S3 and Glacier that permit the use of straightforward SQL expressions to extract only the necessary bytes from those objects. This innovation significantly enhances nearly all applications that interact with objects in S3 or Glacier.

S3 Select

Currently available in preview, S3 Select empowers applications to obtain only the required subset of data from an object using basic SQL expressions. By leveraging S3 Select to access only the needed data, users can experience remarkable performance boosts, often achieving up to a 400% increase.

For instance, suppose you work as a developer for a major retailer tasked with analyzing weekly sales data from a specific store, while the data for all 200 stores is compiled into a new GZIP-ed CSV file daily. Without S3 Select, you would be required to download, decompress, and process the entire CSV to extract the relevant data. S3 Select enables you to write a simple SQL expression that retrieves only the information pertinent to your store, considerably reducing the volume of data handled and thereby enhancing application performance.

Here is a quick example using Python that demonstrates how to retrieve the first column from a CSV-formatted object:

import boto3
s3 = boto3.client('s3')

r = s3.select_object_content(
        Bucket='jbarr-us-west-2',
        Key='sample-data/airportCodes.csv',
        ExpressionType='SQL',
        Expression="select * from s3object s where s."Country (Name)" like '%United States%'",
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'CSV': {}},
)

for event in r['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed']) 

Impressive, right? We anticipate that S3 Select will be utilized to enhance a variety of applications. This capability for selective data retrieval is especially beneficial for serverless applications crafted with AWS Lambda. In fact, when we adapted the Serverless MapReduce reference architecture to utilize S3 Select for fetching only the necessary data, we noted a 2X performance improvement alongside an 80% cost reduction.

Furthermore, query pushdown using S3 Select is now supported with Spark, Hive, and Presto in Amazon EMR. This feature allows you to offload the computational work of filtering extensive data sets for processing from the EMR cluster directly to Amazon S3, thus improving performance and minimizing data transfer between Amazon EMR and Amazon S3.

Things To Know

Amazon Athena, Amazon Redshift, and Amazon EMR, along with partners such as Cloudera, DataBricks, and Hortonworks, will support S3 Select.

Glacier Select

Organizations in highly regulated sectors, including Financial Services and Healthcare, often write data directly to Amazon Glacier to comply with regulations like SEC Rule 17a-4 and HIPAA. Many S3 users implement lifecycle policies to shift data into Glacier once it is no longer regularly accessed, thereby optimizing storage costs. Unlike traditional archival solutions, such as on-premise tape libraries that impose stringent retrieval limitations and can take weeks to provide valuable analytics, Glacier allows for quick querying of cold data within minutes.

This presents a wealth of new business opportunities for archived data. Glacier Select enables filtering directly against a Glacier object using standard SQL statements.

Glacier Select operates similarly to other retrieval jobs but includes an additional set of parameters in the job request. Here’s a brief example:

import boto3
glacier = boto3.client("glacier")

jobParameters = {
    "Type": "select", "ArchiveId": "ID",
    "Tier": "Expedited",
    "SelectParameters": {
        "InputSerialization": {"csv": {}},
        "ExpressionType": "SQL",
        "Expression": "SELECT * FROM archive WHERE _5='498960'",
        "OutputSerialization": {
            "csv": {}
        }
    },
    "OutputLocation": {
        "S3": {"BucketName": "glacier-select-output", "Prefix": "1"}
    }
}

glacier.initiate_job(vaultName="reInventSecrets", jobParameters=jobParameters)

Things To Know

Glacier Select is now generally available in all commercial regions offering Glacier. The pricing structure for Glacier involves three dimensions: GB of Data Scanned, GB of Data Returned, and Select Requests. Costs for each dimension vary based on the desired speed of results: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).

I hope you find these capabilities useful in enhancing or developing your applications. For additional insights, check out this excellent resource.

– Jordan

Resources:

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *