DEV Community

Cover image for Implementing Complex Semantic Search with MongoDB
Shannon Lal
Shannon Lal

Posted on

Implementing Complex Semantic Search with MongoDB

Recently, I've been developing a feature to enable semantic search on user-uploaded images. Our application allows users to upload and describe their own images, and we wanted to enhance their ability to find the most relevant content within their collections.
Semantic search, which combines traditional keyword search with dense vector search, offers a powerful way to identify the most pertinent search results. However, implementing this feature presented a unique challenge: how to perform semantic search in MongoDB while restricting queries to only the user's own images.

In this blog post, I'll guide you through the steps required to set up hybrid search in MongoDB, addressing both the semantic search capabilities and the user-specific filtering requirements. Whether you're looking to implement similar functionality or simply curious about advanced MongoDB search techniques, this guide will provide you with practical insights and solutions.

If you are new to vector search with mongo, here is are a couple of links to some previous posts I put together which explains what vector search is (https://dev.to/shannonlal/building-blocks-for-hybrid-search-combining-keyword-and-semantic-search-236k) and how to implement basic hybrid searching in Mongo (https://dev.to/shannonlal/navigating-hybrid-search-with-mongodb-ugly-approach-1p9g). If you are up to speed on semantic search with Mongo, let's dive in and see how we can leverage MongoDB's features to create a personalized, intelligent search experience for your users.

In the following example, we will create a collection in Mongo called user_image, which stores image descriptions with the following key fields:

  • description: A string describing the image
  • descriptionValues: An array of numbers representing the vector embedding of the description
  • userId: A string identifying the user who uploaded the image
  • url: A string link to where the image is stored
  • deleted: A boolean flag that indicates that the images has been deleted.
  • createdAt: A timestamp for when the image is created
  • updatedAt: A timestamp for when the image or description is updated

Our goal is to perform a hybrid search that combines the power of vector similarity and text matching while ensuring that users only see results from their own uploads.

Step 1: Creating a Vector Index

First, we need to create a vector index to support our vector search. Here's the index definition:

{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "descriptionValues",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "userId",
      "type": "filter"
    },
    {
      "path": "deleted",
      "type": "filter"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This index allows us to perform vector searches on the descriptionValues field while also enabling filtering on userId and deleted fields.

Step 2: Creating a Search Index

Next, we create a search index to support text searches:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "description": {
        "analyzer": "lucene.standard",
        "type": "string"
      },
      "deleted": {
        "type": "boolean"
      },
      "userId": {
        "analyzer": "lucene.keyword",
        "type": "string"
      }
    }
  },
  "storedSource": {
    "include": [
      "description",
      "userId",
      "url",
      "createdAt",
      "updatedAt"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

This index enables text searches on the description field and allows filtering on userId and deleted fields.

Step 3: Implementing the Vector Search

Now, let's look at the MongoDB aggregation pipeline that performs the vector search with filtering:

parameters passed in:
- embedding of query
- limit (Limit the results returned)
- cursor (used for pagniation)

  return [
    {
      $vectorSearch: {
        index: 'ai_image_description_vector_index',
        path: 'descriptionValues',
        queryVector: embedding,
        numCandidates: limit,
        limit: limit,
        filter: {
          userId: userId,
          deleted: false,
        },
      },
    },
    {
      $addFields: {
        searchType: 'vector',
        searchScore: { $meta: 'vectorSearchScore' },
      },
    },
    {
      $sort: { searchScore: -1, createdAt: -1 },
    },
    { $skip: cursor ? parseInt(cursor) : 0 },
    { $limit: limit },
    {
      $project: {
        _id: 1,
        description: 1,
        url: 1,
        userId: 1,
        createdAt: 1,
        updatedAt: 1,
        searchType: 1,
        searchScore: 1,
      },
    },
  ];
Enter fullscreen mode Exit fullscreen mode

The following aggregation has the following inputs:

  • embedding - This is the embedding of the search term
  • limit - This is optional but adds a limit which can be used for pagination
  • cursor - This is optional but adds the option to skip ahead which can be useful for pagination

Step 4: Implement Hybrid Search

Step 4: Incorporate Search Index filtering

[
    {
      $vectorSearch: {
        index: 'image_description_vector_index',
        path: 'descriptionValues',
        queryVector: embedding,
        numCandidates: limit,
        limit: limit,
        filter: {
          userId: userId,
          brandId: brandId,
        },
      },
    },
    {
      $addFields: {
        searchType: 'vector',
        searchScore: { $meta: 'vectorSearchScore' },
      },
    },
    {
      $unionWith: {
        coll: 'user_image',
        pipeline: [
          {
            $search: {
              index: 'image_description',
              compound: {
                must: [
                  {
                    text: {
                      query: query,
                      path: 'description',
                      fuzzy: {
                        maxEdits: 1,
                        prefixLength: 1,
                      },
                    },
                  },
                ],
                filter: [
                  { text: { path: 'userId', query: userId } },
                  { equals: { path: 'deleted', value: false } },
                ],
              },
            },
          },
          { $limit: limit },
          {
            $addFields: {
              searchType: 'text',
              searchScore: { $meta: 'searchScore' },
            },
          },
        ],
      },
    },
    {
      $sort: { searchScore: -1, createdAt: -1 },
    },
    { $skip: cursor ? parseInt(cursor) : 0 },
    { $limit: limit },
    {
      $project: {
        _id: 1,
        description: 1,
        url: 1,
        userId: 1,
        createdAt: 1,
        updatedAt: 1,
        searchType: 1,
        searchScore: 1,
      },
    },
  ];
Enter fullscreen mode Exit fullscreen mode

The query above leverages $unionWith to incorporate $search into the query. It uses a compound option to the pipeline to perform the text search on the description field while also doing applying a filtering by only looking at the user images

Conclusion

Implementing hybrid search with filtering in MongoDB has allowed us to provide a powerful and personalized search experience for our users. By combining vector and text search capabilities and leveraging MongoDB's aggregation framework, we've created a flexible solution that can be easily adapted to various requirements. This approach not only improves the relevance of search results by utilizing both semantic and keyword matching but also ensures data privacy by filtering results based on user ownership. As you implement this in your own projects, remember to monitor performance, especially as your data grows, and continually refine your search algorithm based on user feedback. With these tools and techniques, you can create a robust, efficient, and user-centric search functionality in your MongoDB-based applications.

Top comments (0)