Dialogue-Driven Intelligent Document Processing with Foundation Models on Amazon SageMaker JumpStart | Artificial Intelligence

Dialogue-Driven Intelligent Document Processing with Foundation Models on Amazon SageMaker JumpStart | Artificial IntelligenceMore Info

Intelligent Document Processing (IDP) is an advanced technology that automates the management of substantial amounts of unstructured data, such as text, images, and videos. IDP significantly enhances traditional manual methods and older optical character recognition (OCR) systems by tackling issues like high costs, inaccuracies, and limited scalability, ultimately yielding improved outcomes for organizations and their stakeholders.

Recent advancements in Natural Language Processing (NLP) have elevated the accuracy and overall user experience of IDP. Nevertheless, challenges persist. For example, many IDP systems still lack user-friendliness, making them harder to adopt. Furthermore, several existing solutions do not possess the capability to adjust to changes in data sources, regulations, or user needs through ongoing improvements and updates.

Integrating dialogue capabilities into IDP systems can enhance user interaction by allowing for a more natural and intuitive experience. Through multi-turn dialogue, users can rectify inaccuracies or provide additional information, supported by task automation, leading to systems that are not only more efficient but also more user-friendly.

In this article, we present a cutting-edge approach to IDP that leverages a dialogue-guided query solution utilizing Amazon Foundation Models and SageMaker JumpStart.

Solution Overview

This innovative solution merges OCR for information extraction, a locally deployed Large Language Model (LLM) for dialogue and autonomous task execution, VectorDB for embedding subtasks, and LangChain-based task automation for seamless integration with external data sources. This combination transforms the way organizations manage and analyze document contexts. By leveraging generative AI technologies, businesses can streamline IDP workflows, improving user experience and overall efficiency.

The accompanying video illustrates the dialogue-guided IDP system as it processes an article from the Federal Reserve Board of Governors, detailing the collapse of Silicon Valley Bank in March 2023.

The system can handle images, large PDFs, and documents in various formats while responding to inquiries derived from the content through interactive text or voice inputs. If users need to ask questions beyond the document’s content, the dialogue-guided IDP can generate a sequence of tasks from the text prompt, referencing external and current data sources for pertinent answers. It also supports multi-turn conversations and accommodates multilingual exchanges, all managed through dialogue.

Deploy Your Own LLM Using Amazon Foundation Models

One of the most exciting advancements in generative AI is the integration of LLMs into dialogue systems, creating new possibilities for more intuitive and meaningful interactions. An LLM is an AI model specifically designed to comprehend and generate human-like text. Trained on vast datasets, these models have billions of parameters, allowing them to perform various language-related tasks with remarkable accuracy. This transformative approach facilitates more natural and productive interactions, bridging human intuition and machine intelligence. A major benefit of locally deploying LLMs is enhanced data security, as data does not need to be sent to third-party APIs. Additionally, you can fine-tune your chosen LLM with specialized data, resulting in a more accurate and context-aware understanding of language.

The Jurassic-2 series from AI21 Labs, based on the instruct-tuned 178-billion-parameter Jurassic-1 LLM, forms the backbone of the Amazon foundation models available via Amazon Bedrock. The Jurassic-2 instruct was tailored to handle instructional prompts, known as zero-shot or few-shot, without needing examples. This technique enables the most intuitive interaction with LLMs and ensures an understanding of the optimal output for tasks without requiring examples. You can effortlessly deploy the pre-trained J2-jumbo-instruct or other Jurassic-2 models available on AWS Marketplace into your own virtual private cloud (VPC) using Amazon SageMaker. Here’s a sample code snippet:

import ai21, sagemaker

# Define endpoint name
endpoint_name = "sagemaker-soln-j2-jumbo-instruct"
# Choose real-time inference instance type. Options include g5.48xlarge or p4de.24xlarge 
real_time_inference_instance_type = ("ml.p4d.24xlarge")

# Create a SageMaker endpoint and deploy a pre-trained J2-jumbo-instruct-v1 model from AWS Marketplace.
model_package_arn = "arn:aws:sagemaker:<Your-AWS-Region>:<Your-AWS-Account>:model-package/j2-jumbo-instruct-v1-0-20-8b2be365d1883a15b7d78da7217cdeab"
model = ModelPackage(
role=sagemaker.get_execution_role(),
model_package_arn=model_package_arn,
sagemaker_session=sagemaker.Session()
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type,
endpoint_name=endpoint_name,
model_data_download_timeout=3600,
container_startup_health_check_timeout=600,
)

Once the endpoint is successfully deployed within your VPC, you can initiate an inference task to verify that the deployed LLM is functioning as expected:

response_jumbo_instruct = ai21.Completion.execute(
sm_endpoint=endpoint_name,
prompt="Explain deep learning algorithms to 8th graders",
numResults=1,
maxTokens=100,
temperature=0.01 # This reduces "hallucination" by using common terminology.
)

Document Processing, Embedding, and Indexing

We explore the process of creating an efficient search index, which is essential for intelligent and responsive dialogues guiding document processing. First, we convert documents of various formats into text using OCR and Amazon Textract. Next, we read this content and break it into smaller segments, ideally around the length of a sentence. This granular approach allows for more precise and relevant search results, as it enables better matching of queries against individual segments rather than an entire document. To enhance this process, we utilize embeddings like the sentence transformers library from Hugging Face, which generates vector representations of each sentence. These vectors provide a compact and meaningful representation of the original text, enabling efficient and accurate semantic matching. Finally, we store these vectors in a vector database for similarity searches. This combination of techniques lays the foundation for a revolutionary document processing framework that delivers accurate and intuitive results for users.

OCR is a critical component of the solution, allowing for the extraction of text from scanned documents or images. We can use Amazon Textract to extract text from PDF or image files. This managed OCR service can identify and analyze text in multi-page documents, including PDFs, JPEGs, or TIFFs, such as invoices and receipts. Processing multi-page documents occurs asynchronously, making it suitable for handling large documents. See the following code:

def pdf_2_text(input_pdf_file, history):
history = history or []
key = 

For further insights on this topic, check out this other blog post that delves into related themes. Additionally, for authoritative perspectives, explore this source that discusses the integration of AI in document processing. Also, if you’re looking for community support, this Reddit thread is an excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *