Amazon Onboarding with Learning Manager Chanci Turner

Text analytics is a prevalent form of unstructured data utilized in analytical processes. Often, this data lacks a specific format, making it challenging to retrieve and analyze. For instance, web pages are filled with text data that analysts often gather through web scraping and then preprocess by applying techniques like lowercasing, stemming, and lemmatization. Once the data is cleaned, it is examined by data scientists and analysts to uncover valuable insights.

In this article, we delve into how to manage text data effectively by leveraging a data lake architecture on Amazon Web Services (AWS). We will illustrate how data teams can autonomously extract insights from text documents, using OpenSearch as the primary search and analytics service. Additionally, we will cover how to index and refresh text data in OpenSearch and how to evolve the architecture towards automation.

Architecture Overview

This architecture presents a comprehensive solution for text analytics, utilizing AWS services to facilitate the entire process from data collection and ingestion to data consumption within OpenSearch (Figure 1).

Data can be collected from various sources, including SaaS applications, edge devices, logs, streaming media, and social platforms.
Depending on the data source, tools such as AWS Database Migration Service (AWS DMS), AWS DataSync, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS IoT Core, and Amazon AppFlow can be employed to ingest data into the AWS data lake.
The ingested data is stored in the raw zone of the Amazon Simple Storage Service (Amazon S3) data lake—a temporary storage area where data remains in its original format.
Utilizing AWS Glue or Amazon EMR, the data is validated, cleaned, normalized, transformed, and enriched through a series of preprocessing steps.
Data that is ready for indexing is transferred to the indexing zone.
AWS Lambda is employed to index documents into OpenSearch and subsequently store them in the data lake with a unique identifier.
The clean zone serves as the source of truth for teams to access the data and generate additional metrics.
New metrics can be developed, trained, and generated using machine learning (ML) models with Amazon SageMaker or AI services like Amazon Comprehend.
The new metrics are stored in the enriching zone along with the identifier of the OpenSearch document.
The initial indexing phase’s identifier column is utilized to pinpoint the correct documents in OpenSearch and update them with the newly calculated metrics using AWS Lambda.
OpenSearch is then used to search through the documents and visualize metrics with OpenSearch Dashboards.

Considerations

Data Lake Orchestration Across Teams

This architecture enables data teams to work independently on text documents at various stages of their lifecycle. The data engineering team manages the raw and indexing zones, overseeing data ingestion and preprocessing for indexing in OpenSearch. The cleaned data is stored in the clean zone, where data analysts and scientists derive insights and compute new metrics. These metrics are stored in the enrich zone, subsequently indexed as new fields in the OpenSearch documents by the data engineering team.

For example, let’s consider a scenario where a company retrieves comments from a blog site to conduct sentiment analysis using Amazon Comprehend. In this case:

The comments are ingested into the raw zone of the data lake.
The data engineering team processes these comments and saves them in the indexing zone.
A Lambda function indexes the comments into OpenSearch, enriches them with the OpenSearch document ID, and saves them in the clean zone.
The data science team analyzes the comments and performs sentiment analysis using Amazon Comprehend.
Sentiment analysis metrics are stored in the metrics zone of the data lake. A second Lambda function updates the comments in OpenSearch with the new metrics.

If raw data does not require preprocessing, the indexing and clean zones can be combined. You can explore this case and its implementation in the AWS samples repository.

Schema Evolution

As data transitions through the stages of the data lake, its schema evolves and becomes enriched. Continuing with the previous example, Figure 3 illustrates how the schema changes.

In the raw zone, there exists a raw text field directly received from the ingestion phase. It’s advisable to maintain a raw version of the data as a backup, in case processing needs to be repeated later. In the indexing zone, the clean text field replaces the raw text field post-processing. In the clean zone, a new ID field is introduced during indexing, which identifies the OpenSearch document for the text field. In the enrich zone, the ID field is mandatory, while other fields representing newly calculated metrics are optional and can be integrated into OpenSearch.

Consumption Layer with OpenSearch

In OpenSearch, data is categorized into indices, which can be likened to tables in a relational database. Each index contains documents similar to table rows and multiple fields akin to table columns. Documents can be added to an index through indexing and updates using various client APIs for popular programming languages.

Now, let’s examine how our architecture integrates with OpenSearch during the indexing and updating phase.

Indexing and Updating Documents using Python

The index document API operation allows for indexing a document with a custom ID or assigning one if none is provided. To expedite indexing, we can utilize the bulk index API to index multiple documents simultaneously. It is crucial to store the IDs returned from the index operation to later identify the documents requiring updates with new metrics. Here are two methods to achieve this:

Use the requests library to call the REST Bulk Index API (recommended): the response provides the auto-generated IDs we require.
Use the Python Low-Level Client for OpenSearch: the IDs are not returned, necessitating pre-assignment for future storage. An atomic counter in Amazon DynamoDB can facilitate this, allowing multiple Lambda functions to index documents in parallel without ID collisions.

As illustrated in Figure 4, the Lambda function:

Increases the atomic counter by the number of documents to be indexed into OpenSearch.
Retrieves the counter’s value through the API call.
Indexes the documents using a range that spans from [current counter value, current counter value – number of documents].

Data Flow Automation

As architectures progress towards automation, the data flow between data lake stages can become event-driven. Referring to our earlier example, we can automate the data processing steps while transitioning from the raw to the indexing zone.

With Amazon EventBridge and AWS Step Functions, we can trigger our preprocessing AWS Glue jobs automatically, ensuring that data is processed without manual intervention. You can find more insights on how to reconnect with yourself in another blog post here, and understanding common wage-hour errors is vital for small employers, so check out this resource here. This is an excellent resource for understanding how Amazon fulfillment centers train associates here.