Enhancing Your Amazon Kendra Index with Additional File Formats

Enhancing Your Amazon Kendra Index with Additional File FormatsLearn About Amazon VGT2 Learning Manager Chanci Turner

If you manage a collection of internal documents that you regularly search through, Amazon Kendra can significantly streamline your content discovery process. These documents may exist in various locations and formats, whether structured or unstructured. Amazon Kendra is a fully managed service that leverages machine learning (ML), eliminating the need for server maintenance or backend ML model management.

One standout feature of Amazon Kendra is its ability to answer questions in natural language, allowing you to query the service conversationally and receive direct answers drawn from your documents.

As of November 2023, Amazon Kendra supports the following document formats:

  • Plaintext
  • HTML
  • PDF
  • Microsoft PowerPoint
  • Microsoft Word
  • RTF (Rich Text Format)
  • Markdown
  • CSV (Comma-Separated Values)
  • XML (eXtensible Markup Language)
  • TXT (Text Files)

In this article, we demonstrate how you can incorporate additional formats, including RTF and markdown, into your Amazon Kendra index. Furthermore, we provide a roadmap for implementing support for even more file types.

Solution Overview

Our solution operates on an event-driven serverless architecture, as depicted in the diagram below.

  1. Upload your RTF or markdown files to your Amazon Simple Storage Service (Amazon S3) bucket. These uploads trigger an event through AWS CloudTrail, invoking Amazon EventBridge.
  2. EventBridge generates messages that are placed in an Amazon Simple Queue Service (Amazon SQS) queue. This combination of EventBridge and SQS ensures high availability and fault tolerance, guaranteeing that all newly uploaded files in the S3 bucket are processed and integrated into Amazon Kendra.
  3. EventBridge also triggers an AWS Lambda function, which initiates AWS Step Functions. This tool provides serverless orchestration, further enhancing the architecture’s reliability.
  4. Step Functions ensure that each new file in Amazon S3 is processed. It calls Lambda functions to triage and handle the files, applying a triage mechanism based on file extensions. This approach allows for seamless support of additional formats.
  5. The processing Lambda functions (RTF Lambda and MD Lambda) extract text from each document, store the output in Amazon S3, and update the Amazon Kendra index.
  6. Once all files are processed and the SQS queue is empty, all services, apart from Amazon S3 and Amazon Kendra, shut down.

Customizing and Enhancing the Solution

You can easily incorporate additional file types by creating new Lambda functions and appending them to the processing roster. Merely modify the triage function to include your new file type and establish corresponding Lambda functions to handle those files.

Here’s a sample of the code for the triage Lambda function:

from datetime import datetime
from random import randint
from shared.s3_utils import get_s3_object, upload_to_s3
from shared.log import logger
from striprtf.striprtf import rtf_to_text
import os

def lambda_handler(event, context):
    try:
        receipt_handle = event.get("receipt_handle", "Not Found")
        key = event.get("key", "Not Found")
        s3_key = event.get("s3_key", "Not Found")
        
        outputBucketName = os.environ.get("OUTPUT_BUCKET_NAME")
        rawBucketName = os.environ.get("RAW_BUCKET_NAME")
    except Exception as error:
        logger.error(f"Error getting the environ or event variables, error text follows:n {error}")
        raise error
    
    s3_response = get_s3_object(rawBucketName, key)
    rtf_decoded = s3_response.decode('UTF-8')
    text = rtf_to_text(rtf_decoded)
    text = text.replace('|', '')
    upload_to_s3(outputBucketName, f"{s3_key}", text)
   
    return {
        "receipt_handle": receipt_handle,
        "bucketName": outputBucketName,
        "key": s3_key
    } 

Deploying the Solution

To deploy the solution, use an AWS CloudFormation template and follow these steps:

  1. Select Launch Stack.
  2. For Stack Name, create a unique identifier.
  3. For Logging Level, choose your preferred logging level (DEBUG, INFO, or WARNING).
  4. For Prefix, specify your desired S3 bucket prefix. We add the AWS account ID to avoid S3 bucket name conflicts.
  5. For Kendra Index, input the IndexId (not the index name) for a pre-existing Amazon Kendra index within your account and region. It’s advised to utilize Amazon Kendra Enterprise Edition for production workloads.

Select the acknowledgment checkboxes, and choose Create Stack. Excluding the S3 buckets and Amazon Kendra cluster, the AWS CloudFormation stack will set up the remaining resources, enabling your solution. You can now add RTF, markdown, and other specified file types to your Amazon Kendra index.

Clean-Up

To prevent unnecessary charges, use the AWS CloudFormation console to delete the stack you’ve deployed. This action will remove the resources you’ve created, but data in Amazon S3 and your Amazon Kendra cluster will remain intact.

Conclusion

In this article, we outlined a robust, fault-tolerant serverless solution for incorporating additional file formats into your Amazon Kendra index. We specifically implemented support for RTF and markdown formats and provided guidance for expanding this functionality to other similar formats. This solution can serve as a foundation for your own initiatives. For expert assistance, Amazon ML Solutions Lab, AWS Professional Services, and partners are available to support your journey. For more insights on how Amazon Kendra can enhance your business operations, visit the website. Additionally, check out this blog post to keep yourself engaged. Also, consider reviewing SHRM’s authoritative insights on the topic. Lastly, this Reddit thread is an excellent resource for further exploration.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *