Hybrid Search with Amazon OpenSearch Service | AWS Big Data Blog

Hybrid Search with Amazon OpenSearch Service | AWS Big Data BlogMore Info

Amazon OpenSearch Service has long provided robust support for both lexical and semantic search thanks to its k-nearest neighbors (k-NN) plugin. By utilizing OpenSearch Service as a vector database, you can effectively merge the benefits of lexical and vector search. The newly introduced neural search feature in OpenSearch Service 2.9 greatly enhances integration with artificial intelligence (AI) and machine learning (ML) models, making it easier to apply semantic search techniques.

For decades, lexical search methods such as TF/IDF or BM25 have been the backbone of search systems. These traditional algorithms focus on matching user queries with precise words or phrases found in your documents. Lexical search excels in providing exact matches, exhibits low latency, and offers a high degree of interpretability for results, allowing it to generalize well across various domains. However, this method often overlooks the context or meaning behind the words, potentially resulting in irrelevant outcomes.

Recently, semantic search techniques based on vector embeddings have gained traction as a means to enhance search capabilities. Semantic search allows for a more context-aware approach, enabling a better understanding of the natural language queries posed by users. Nonetheless, semantic search dependent on vector embeddings necessitates fine-tuning of the ML model for the relevant domain (like healthcare or retail) and generally requires more memory resources compared to basic lexical search.

Both lexical and semantic search come with their unique advantages and disadvantages. The combination of these two approaches enhances search result quality by leveraging the strengths of each in a hybrid model. OpenSearch Service 2.11 now provides built-in hybrid query capabilities, making it simple to implement a hybrid search model that integrates both lexical and semantic search.

This article will delve into the mechanics of hybrid search and guide you through building a hybrid search solution using OpenSearch Service. We will conduct experiments with sample queries to explore and compare lexical, semantic, and hybrid search methodologies. All code utilized in this post is publicly available on GitHub.

Hybrid Search with OpenSearch Service

In essence, hybrid search that merges lexical and semantic search consists of the following steps:

  1. Execute a semantic and lexical search using a compound search query clause.
  2. Each query type returns scores on different scales. For instance, a Lucene lexical search query will yield scores ranging from 1 to infinity, while a semantic query using the Faiss engine will produce scores between 0 and 1. Thus, normalization of scores from each query type is necessary to align them on a common scale before combining. In a distributed search engine, this normalization must occur at the global level rather than at the shard or node level.
  3. Once the scores are normalized, they can be combined for each document.
  4. Finally, reorder the documents based on the new combined score and present them as the query response.

Before OpenSearch Service 2.11, search practitioners had to rely on compound query types to combine lexical and semantic searches. However, this method did not resolve the challenge of global score normalization as described in Step 2.

With the introduction of version 2.11, OpenSearch Service now supports hybrid queries through a score normalization processor within search pipelines. This innovation alleviates the burden of building score normalization and combination outside of your OpenSearch Service domain. Search pipelines operate within the OpenSearch Service domain and include three types of processors: search request processor, search response processor, and search phase results processor.

In a hybrid search, the search phase results processor executes between the query phase and fetch phase at the coordinator node (global) level. The following diagram illustrates this workflow.

The hybrid search workflow in OpenSearch Service involves the following phases:

  • Query phase: This initial phase of a search request sees each shard in your index executing the search query locally, returning document IDs that match the search request along with relevance scores for each document.
  • Score normalization and combination: The search phase results processor operates between the query and fetch phases. It utilizes the normalization processor to standardize scoring results from BM25 and KNN subqueries. The search processor supports min_max and L2-Euclidean distance normalization methods. It combines all scores, compiles the final ranked document IDs list, and hands it off to the fetch phase. The processor also supports arithmetic_mean, geometric_mean, and harmonic_mean for score combination.
  • Fetch phase: The concluding phase is the fetch phase, where the coordinator node retrieves the documents that correspond to the finalized ranked list and returns the results of the search query.

Solution Overview

In this post, we will create a web application that enables searching through a sample image dataset in the retail sector, utilizing a hybrid search system powered by OpenSearch Service. Imagine that the web application represents a retail shop where you, as a consumer, need to run queries to find women’s shoes.

For the hybrid search, you will merge a lexical and semantic search query against the text captions of images within the dataset. The high-level architecture of the end-to-end search application is depicted in the following figure.

The workflow consists of these steps:

  1. Using an Amazon SageMaker notebook, you will index image captions and image URLs from the Amazon Berkeley Objects Dataset stored in Amazon Simple Storage Service (Amazon S3) into OpenSearch Service via the OpenSearch ingest pipeline. This dataset features 147,702 product listings with multilingual metadata and 398,212 unique catalog images. For demonstration purposes, approximately 1,600 products will be utilized, focusing only on item images and names in US English.
  2. OpenSearch Service interacts with the embedding model hosted in SageMaker to generate vector embeddings for the image captions. You will employ the GPT-J-6B variant embedding model, which produces 4,096 dimensional vectors.
  3. You can then input your search query into the web application hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance (c5.large). The application client triggers the hybrid query in OpenSearch Service.
  4. OpenSearch Service calls the SageMaker embedding model to generate vector embeddings for the search query.
  5. Finally, OpenSearch Service executes the hybrid query, integrating the semantic and lexical search scores for the documents, and sends the search results back to the EC2 application client.

Let’s delve deeper into Steps 1, 2, 4, and 5.

Step 1: Data Ingestion into OpenSearch

In Step 1, you will create an ingest pipeline in OpenSearch Service utilizing the text_embedding processor to generate vector embeddings for the image captions. After defining a k-NN index with the ingest pipeline, you will perform a bulk index operation to store your data into the k-NN index. In this solution, only the image URLs, text captions, and caption embeddings where the field type for the caption embeddings is k-NN vector will be indexed.

Steps 2 and 4: OpenSearch Service Calls the SageMaker Embedding Model

During these steps, OpenSearch Service leverages the SageMaker ML connector to generate embeddings for both the image captions and the search query. The blue box in the preceding architecture diagram refers to the integration of OpenSearch Service with SageMaker using the ML connector feature, available in OpenSearch Service starting from version 2.9. This is another blog post to keep the reader engaged; for more in-depth information, you can refer to this excellent resource.

Additionally, for authoritative insights on this topic, check out this resource.

The address for the Amazon IXD – VGT2 location is 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

SEO Metadata:


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *