Learn About Amazon VGT2 Learning Manager Chanci Turner
Amazon OpenSearch Service has long been a proponent of both lexical and semantic search, made possible through the integration of the k-nearest neighbors (k-NN) plugin. By utilizing OpenSearch Service as a vector database, users can easily leverage the benefits of both lexical and vector search methods. The introduction of the neural search feature in OpenSearch Service 2.9 has further streamlined integration with artificial intelligence (AI) and machine learning (ML) models, thereby facilitating the implementation of semantic search.
Lexical search, which employs methods like TF/IDF or BM25, has been the backbone of search systems for years. These traditional lexical search algorithms match user inquiries with exact words or phrases found in documents. Lexical search is particularly effective for exact matches and provides low latency, along with a straightforward interpretability of results. However, it often overlooks the context and meaning behind words, which can result in irrelevant outcomes.
In recent years, semantic search techniques based on vector embeddings have gained traction in enhancing search capabilities. Semantic search offers a more context-aware approach, allowing for a better understanding of user inquiries presented in natural language. Nevertheless, utilizing semantic search powered by vector embeddings necessitates fine-tuning of the ML model for specific domains (like healthcare or retail) and demands more memory resources than basic lexical search.
Both lexical and semantic searches come with their unique advantages and drawbacks. Merging these two approaches enhances search result quality by capitalizing on their strengths within a hybrid model. OpenSearch Service 2.11 now provides out-of-the-box hybrid query functionalities, making it easier to implement a hybrid search model that integrates both lexical and semantic search.
This article delves into the inner workings of hybrid search and guides you through building a hybrid search solution using OpenSearch Service. We will experiment with sample queries to compare lexical, semantic, and hybrid search. All code referenced in this post is publicly accessible in the GitHub repository.
Hybrid Search with OpenSearch Service
Generally, hybrid search combines lexical and semantic methods through the following steps:
- Execute a semantic and lexical search using a compound search query clause.
- Each query type yields scores on different scales. For instance, a Lucene lexical search query will return scores ranging from 1 to infinity, while a semantic query utilizing the Faiss engine yields scores between 0 and 1. Therefore, it’s essential to normalize the scores derived from each query type to align them on the same scale before merging them. In a distributed search engine, normalization must occur at the global level rather than at the shard or node level.
- Once scores are normalized, they are amalgamated for each document.
- Finally, the documents are reordered based on the newly combined score, and the results are presented in response to the query.
Before OpenSearch Service 2.11, search practitioners had to use compound query types to merge lexical and semantic searches. However, this method did not solve the issue of global score normalization as mentioned in Step 2.
With the advent of OpenSearch Service 2.11, hybrid queries are supported through the introduction of the score normalization processor within search pipelines. These pipelines simplify the process of score normalization and combination within the OpenSearch Service domain. They operate inside the OpenSearch Service domain and support three types of processors: the search request processor, search response processor, and search phase results processor.
In a hybrid search, the search phase results processor functions between the query phase and the fetch phase at the coordinator node (global) level. The hybrid search workflow in OpenSearch Service comprises the following phases:
- Query Phase: The first phase of a search request, where each shard in your index executes the search query locally and returns the document IDs that match the search request along with relevance scores for each document.
- Score Normalization and Combination: The search phase results processor operates between the query and fetch phases. It leverages the normalization processor to standardize scoring results from BM25 and KNN subqueries. The search processor accommodates min_max and L2-Euclidean distance normalization methods. It combines all scores, compiles a ranked list of document IDs, and passes them to the fetch phase, utilizing arithmetic_mean, geometric_mean, and harmonic_mean for score combination.
- Fetch Phase: The final phase is where the coordinator node retrieves the documents that correspond to the final ranked list and returns the search query results.
Solution Overview
In this article, we will develop a web application that enables users to search through a sample image dataset within the retail sector, powered by a hybrid search system utilizing OpenSearch Service. For our example, let’s consider that the web application is for a retail store and you, as a consumer, are looking to search for women’s shoes.
For the hybrid search, you will combine a lexical and semantic search query targeting the text captions associated with images in the dataset. The high-level architecture of the end-to-end search application is depicted in the following image.
The workflow encompasses these steps:
- You will utilize an Amazon SageMaker notebook to index image captions and URLs from the Amazon Berkeley Objects Dataset stored in Amazon Simple Storage Service (Amazon S3) into OpenSearch Service using the OpenSearch ingest pipeline. This dataset comprises 147,702 product listings with multilingual metadata and 398,212 unique catalog images, though for demonstration purposes, we will focus on approximately 1,600 products.
- OpenSearch Service will invoke the embedding model hosted in SageMaker to create vector embeddings for the image captions. The GPT-J-6B variant embedding model will generate 4,096-dimensional vectors.
- You can then input your search query into the web application hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance (c5.large). The application client will trigger the hybrid query in OpenSearch Service.
- OpenSearch Service will call the SageMaker embedding model to generate vector embeddings for the search query.
- Finally, OpenSearch Service will execute the hybrid query, amalgamate the semantic and lexical search scores for the documents, and return the search results to the EC2 application client.
Let’s delve into Steps 1, 2, 4, and 5 in greater detail.
Step 1: Data Ingestion into OpenSearch
In this step, you will establish an ingest pipeline within OpenSearch Service using the text_embedding processor to create vector embeddings for the image captions. After defining a k-NN index with the ingest pipeline, you will perform a bulk index operation to store your data into the k-NN index. In this solution, you will index only the image URLs, text captions, and caption embeddings, where the field type for caption embeddings will be set to k-NN vector.
Steps 2 and 4: OpenSearch Service Calls the SageMaker Embedding Model
During these steps, OpenSearch Service utilizes the SageMaker ML connector to generate embeddings for the image captions and the search query. The blue box in the preceding architecture diagram illustrates the integration between OpenSearch Service and SageMaker via the ML connector feature, which has been available since version 2.9.
For those looking to start their own business and navigate the complexities of entrepreneurship, this article on how to start your own business can offer valuable insights. Additionally, if you’re interested in employment law, the authority on this topic, SHRM provides essential information. For individuals aspiring to develop their leadership skills, this resource can be a great starting point.
Leave a Reply