Creating an Agentic Multimodal AI Assistant with Amazon Nova and Amazon Bedrock Data Automation

Creating an Agentic Multimodal AI Assistant with Amazon Nova and Amazon Bedrock Data AutomationMore Info

Modern businesses are inundated with diverse data types—from text documents and PDFs to images, audio recordings, and presentation slides. Picture an AI assistant addressing questions about your company’s quarterly earnings call; this assistant should not only interpret the transcript but also “see” the charts in the slides and “hear” the CEO’s comments. Gartner forecasts that by 2027, 40% of generative AI solutions will be multimodal (covering text, images, audio, and video), a significant rise from just 1% in 2023. This trend highlights the growing importance of multimodal understanding in business applications. To achieve this, a multimodal generative AI assistant is essential—one capable of integrating text, visuals, and other data types. Furthermore, an agentic architecture is necessary so that the AI assistant can actively seek information, plan tasks, and make decisions regarding tool usage, rather than merely responding to prompts.

In this article, we present a solution that accomplishes just that—utilizing Amazon Nova Pro, a sophisticated multimodal large language model (LLM) from AWS, as the central orchestrator, alongside powerful new features from Amazon Bedrock, including Amazon Bedrock Data Automation for handling multimodal data. We illustrate how agentic workflow patterns like Retrieval Augmented Generation (RAG), multi-tool orchestration, and conditional routing with LangGraph can provide comprehensive solutions that AI/ML developers and enterprise architects can adopt and adapt. For instance, we explore a financial management AI assistant that can deliver quantitative research and grounded financial advice by analyzing both the earnings call (audio) and presentation slides (images), as well as relevant financial data feeds. You can also learn more about similar implementations in another blog post found here.

Overview of the Agentic Workflow

The core of the agentic pattern comprises the following stages:

  1. Reason – The agent (often an LLM) evaluates the user’s request and the current context. It determines the next action—whether to provide a direct answer or invoke a tool or sub-task for additional information.
  2. Act – The agent executes that action, which could involve utilizing a tool or function such as a search query, a database lookup, or document analysis via Amazon Bedrock Data Automation.
  3. Observe – The agent reviews the outcome of the action. For instance, it examines the retrieved text or data returned from the tool.
  4. Loop – With the new information, the agent reassesses whether the task is complete or if further steps are needed, continuing this loop until it can deliver a final answer to the user.

This iterative decision-making process enables the agent to manage complex requests that cannot be resolved with a single prompt. However, designing agentic systems can be quite challenging, introducing complexities in control flow. Naïve agents may be inefficient (making excessive tool calls or looping unnecessarily) or difficult to manage as they scale. This is where structured frameworks like LangGraph become valuable. LangGraph allows for the definition of a directed graph (or state machine) of potential actions with clearly defined nodes (actions like “Report Writer” or “Query Knowledge Base”) and edges (possible transitions). While the agent’s internal reasoning still dictates which path to pursue, LangGraph ensures the process remains manageable and transparent. This controlled flexibility allows the assistant sufficient autonomy to handle a variety of tasks while maintaining a stable and predictable overall workflow.

Solution Overview

This solution centers on a financial management AI assistant designed to aid analysts in querying portfolios, analyzing companies, and generating reports. At its heart is Amazon Nova, an LLM that serves as an intelligent model for inference. Amazon Nova processes text, images, or documents (such as earnings call slides) and dynamically determines which tools to employ to fulfill requests. Optimized for enterprise tasks, it supports function calling, enabling the model to plan actions and call tools systematically. With a large context window (up to 300,000 tokens with Amazon Nova Lite and Amazon Nova Pro), it can effectively manage lengthy documents or conversation histories during reasoning.

The workflow encompasses the following key components:

  • Knowledge Base Retrieval – Both the earnings call audio file and PowerPoint presentation are processed by Amazon Bedrock Data Automation, a managed service that extracts text, transcribes audio and video, and prepares data for analysis. If a user uploads a PowerPoint file, the system converts each slide into an image (PNG) for efficient search and analysis, a method inspired by generative AI applications like Manus. Amazon Bedrock Data Automation effectively functions as a multimodal AI pipeline from the outset. In our architecture, it serves as a bridge between raw data and the agentic workflow. Subsequently, Amazon Bedrock Knowledge Bases converts these extracted chunks into vector embeddings using Amazon Titan Text Embeddings V2 and stores them in an Amazon OpenSearch Serverless database.
  • Router Agent – When a user poses a question—for example, “Summarize the key risks in this Q3 earnings report”—Amazon Nova first assesses whether the task necessitates retrieving data, processing a file, or generating a response. It retains memory of the dialogue, interprets the user’s request, and strategizes the necessary actions to fulfill it. The “Memory & Planning” module in the solution diagram indicates that the router agent can utilize conversation history and chain-of-thought (CoT) prompting to navigate next steps. Importantly, the router agent discerns if the query can be addressed using internal company data or if it needs external information and tools.
  • Multimodal RAG Agent – For queries pertaining to audio and video information, Amazon Bedrock Data Automation employs a unified API call to extract insights from multimedia data, storing these insights in Amazon Bedrock Knowledge Bases. Amazon Nova accesses Amazon Bedrock Knowledge Bases to retrieve factual answers through semantic search, ensuring responses are grounded in actual data, thus minimizing the risk of hallucination. If Amazon Nova generates an answer, a secondary hallucination check cross-references the response against trusted sources to catch unsupported claims.
  • Hallucination Check (Quality Gate) – To enhance reliability, the workflow may incorporate a post-processing step using a different foundation model (FM) outside of the Amazon Nova family, such as Anthropic’s Claude, Mistral, or Meta’s Llama, to evaluate the answer’s accuracy. For instance, after Amazon Nova produces a response, a hallucination detector model or function can compare the answer against the retrieved sources or established facts. If a potential hallucination is detected (the answer isn’t supported by the reference data), the agent can opt to perform additional retrieval, modify the response, or escalate to a human for review.
  • Multi-tool Collaboration – This collaboration empowers the AI not only to locate information but also to execute actions prior to formulating a final answer. This introduces multi-tool options. The supervisor agent may initiate or coordinate various tool-specific agents (for instance, a web search agent for general inquiries, a stock search agent for market data, or other specialized agents). For more insights on similar implementations, you can also check this link, as they are an authority on this topic. Additionally, if you’re looking for a community resource, this Reddit thread provides excellent insights.

SEO Metadata:


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *