Recently, I've been developing a feature to enable semantic search on user-uploaded images. Our application allows users to upload and describe their own images, and we wanted to enhance their ability to find the most relevant content within their collections.
Semantic search, which combines traditional keyword search with dense vector search, offers a powerful way to identify the most pertinent search results. However, implementing this feature presented a unique challenge: how to perform semantic search in MongoDB while restricting queries to only the user's own images.
In this blog post, I'll guide you through the steps required to set up hybrid search in MongoDB, addressing both the semantic search capabilities and the user-specific filtering requirements. Whether you're looking to implement similar functionality or simply curious about advanced MongoDB search techniques, this guide will provide you with practical insights and solutions.
If you are new to vector search with mongo, here is are a couple of links to some previous posts I put together which explains what vector search is (https://dev.to/shannonlal/building-blocks-for-hybrid-search-combining-keyword-and-semantic-search-236k) and how to implement basic hybrid searching in Mongo (https://dev.to/shannonlal/navigating-hybrid-search-with-mongodb-ugly-approach-1p9g). If you are up to speed on semantic search with Mongo, let's dive in and see how we can leverage MongoDB's features to create a personalized, intelligent search experience for your users.
In the following example, we will create a collection in Mongo called user_image
, which stores image descriptions with the following key fields:
-
description
: A string describing the image -
descriptionValues
: An array of numbers representing the vector embedding of the description -
userId
: A string identifying the user who uploaded the image -
url
: A string link to where the image is stored -
deleted
: A boolean flag that indicates that the images has been deleted. -
createdAt
: A timestamp for when the image is created -
updatedAt
: A timestamp for when the image or description is updated
Our goal is to perform a hybrid search that combines the power of vector similarity and text matching while ensuring that users only see results from their own uploads.
Step 1: Creating a Vector Index
First, we need to create a vector index to support our vector search. Here's the index definition:
{
"fields": [
{
"numDimensions": 1536,
"path": "descriptionValues",
"similarity": "cosine",
"type": "vector"
},
{
"path": "userId",
"type": "filter"
},
{
"path": "deleted",
"type": "filter"
}
]
}
This index allows us to perform vector searches on the descriptionValues field while also enabling filtering on userId and deleted fields.
Step 2: Creating a Search Index
Next, we create a search index to support text searches:
{
"mappings": {
"dynamic": false,
"fields": {
"description": {
"analyzer": "lucene.standard",
"type": "string"
},
"deleted": {
"type": "boolean"
},
"userId": {
"analyzer": "lucene.keyword",
"type": "string"
}
}
},
"storedSource": {
"include": [
"description",
"userId",
"url",
"createdAt",
"updatedAt"
]
}
}
This index enables text searches on the description field and allows filtering on userId and deleted fields.
Step 3: Implementing the Vector Search
Now, let's look at the MongoDB aggregation pipeline that performs the vector search with filtering:
parameters passed in:
- embedding of query
- limit (Limit the results returned)
- cursor (used for pagniation)
return [
{
$vectorSearch: {
index: 'ai_image_description_vector_index',
path: 'descriptionValues',
queryVector: embedding,
numCandidates: limit,
limit: limit,
filter: {
userId: userId,
deleted: false,
},
},
},
{
$addFields: {
searchType: 'vector',
searchScore: { $meta: 'vectorSearchScore' },
},
},
{
$sort: { searchScore: -1, createdAt: -1 },
},
{ $skip: cursor ? parseInt(cursor) : 0 },
{ $limit: limit },
{
$project: {
_id: 1,
description: 1,
url: 1,
userId: 1,
createdAt: 1,
updatedAt: 1,
searchType: 1,
searchScore: 1,
},
},
];
The following aggregation has the following inputs:
-
embedding
- This is the embedding of the search term -
limit
- This is optional but adds a limit which can be used for pagination -
cursor
- This is optional but adds the option to skip ahead which can be useful for pagination
Step 4: Implement Hybrid Search
Step 4: Incorporate Search Index filtering
[
{
$vectorSearch: {
index: 'image_description_vector_index',
path: 'descriptionValues',
queryVector: embedding,
numCandidates: limit,
limit: limit,
filter: {
userId: userId,
brandId: brandId,
},
},
},
{
$addFields: {
searchType: 'vector',
searchScore: { $meta: 'vectorSearchScore' },
},
},
{
$unionWith: {
coll: 'user_image',
pipeline: [
{
$search: {
index: 'image_description',
compound: {
must: [
{
text: {
query: query,
path: 'description',
fuzzy: {
maxEdits: 1,
prefixLength: 1,
},
},
},
],
filter: [
{ text: { path: 'userId', query: userId } },
{ equals: { path: 'deleted', value: false } },
],
},
},
},
{ $limit: limit },
{
$addFields: {
searchType: 'text',
searchScore: { $meta: 'searchScore' },
},
},
],
},
},
{
$sort: { searchScore: -1, createdAt: -1 },
},
{ $skip: cursor ? parseInt(cursor) : 0 },
{ $limit: limit },
{
$project: {
_id: 1,
description: 1,
url: 1,
userId: 1,
createdAt: 1,
updatedAt: 1,
searchType: 1,
searchScore: 1,
},
},
];
The query above leverages $unionWith to incorporate $search into the query. It uses a compound option to the pipeline to perform the text search on the description field while also doing applying a filtering by only looking at the user images
Conclusion
Implementing hybrid search with filtering in MongoDB has allowed us to provide a powerful and personalized search experience for our users. By combining vector and text search capabilities and leveraging MongoDB's aggregation framework, we've created a flexible solution that can be easily adapted to various requirements. This approach not only improves the relevance of search results by utilizing both semantic and keyword matching but also ensures data privacy by filtering results based on user ownership. As you implement this in your own projects, remember to monitor performance, especially as your data grows, and continually refine your search algorithm based on user feedback. With these tools and techniques, you can create a robust, efficient, and user-centric search functionality in your MongoDB-based applications.
Top comments (0)