Shannon Lal

Posted on Oct 24, 2024

Implementing Complex Semantic Search with MongoDB

#ai #mongodb #openai #vectordatabase

Recently, I've been developing a feature to enable semantic search on user-uploaded images. Our application allows users to upload and describe their own images, and we wanted to enhance their ability to find the most relevant content within their collections.
Semantic search, which combines traditional keyword search with dense vector search, offers a powerful way to identify the most pertinent search results. However, implementing this feature presented a unique challenge: how to perform semantic search in MongoDB while restricting queries to only the user's own images.

In this blog post, I'll guide you through the steps required to set up hybrid search in MongoDB, addressing both the semantic search capabilities and the user-specific filtering requirements. Whether you're looking to implement similar functionality or simply curious about advanced MongoDB search techniques, this guide will provide you with practical insights and solutions.

If you are new to vector search with mongo, here is are a couple of links to some previous posts I put together which explains what vector search is (https://dev.to/shannonlal/building-blocks-for-hybrid-search-combining-keyword-and-semantic-search-236k) and how to implement basic hybrid searching in Mongo (https://dev.to/shannonlal/navigating-hybrid-search-with-mongodb-ugly-approach-1p9g). If you are up to speed on semantic search with Mongo, let's dive in and see how we can leverage MongoDB's features to create a personalized, intelligent search experience for your users.

In the following example, we will create a collection in Mongo called user_image, which stores image descriptions with the following key fields:

description: A string describing the image
descriptionValues: An array of numbers representing the vector embedding of the description
userId: A string identifying the user who uploaded the image
url: A string link to where the image is stored
deleted: A boolean flag that indicates that the images has been deleted.
createdAt: A timestamp for when the image is created
updatedAt: A timestamp for when the image or description is updated

Our goal is to perform a hybrid search that combines the power of vector similarity and text matching while ensuring that users only see results from their own uploads.

Step 1: Creating a Vector Index

First, we need to create a vector index to support our vector search. Here's the index definition:

{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "descriptionValues",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "userId",
      "type": "filter"
    },
    {
      "path": "deleted",
      "type": "filter"
    }
  ]
}

This index allows us to perform vector searches on the descriptionValues field while also enabling filtering on userId and deleted fields.

Step 2: Creating a Search Index

Next, we create a search index to support text searches:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "description": {
        "analyzer": "lucene.standard",
        "type": "string"
      },
      "deleted": {
        "type": "boolean"
      },
      "userId": {
        "analyzer": "lucene.keyword",
        "type": "string"
      }
    }
  },
  "storedSource": {
    "include": [
      "description",
      "userId",
      "url",
      "createdAt",
      "updatedAt"
    ]
  }
}

This index enables text searches on the description field and allows filtering on userId and deleted fields.

Step 3: Implementing the Vector Search

Now, let's look at the MongoDB aggregation pipeline that performs the vector search with filtering:

parameters passed in:
- embedding of query
- limit (Limit the results returned)
- cursor (used for pagniation)

  return [
    {
      $vectorSearch: {
        index: 'ai_image_description_vector_index',
        path: 'descriptionValues',
        queryVector: embedding,
        numCandidates: limit,
        limit: limit,
        filter: {
          userId: userId,
          deleted: false,
        },
      },
    },
    {
      $addFields: {
        searchType: 'vector',
        searchScore: { $meta: 'vectorSearchScore' },
      },
    },
    {
      $sort: { searchScore: -1, createdAt: -1 },
    },
    { $skip: cursor ? parseInt(cursor) : 0 },
    { $limit: limit },
    {
      $project: {
        _id: 1,
        description: 1,
        url: 1,
        userId: 1,
        createdAt: 1,
        updatedAt: 1,
        searchType: 1,
        searchScore: 1,
      },
    },
  ];

The following aggregation has the following inputs:

embedding - This is the embedding of the search term
limit - This is optional but adds a limit which can be used for pagination
cursor - This is optional but adds the option to skip ahead which can be useful for pagination

Step 4: Implement Hybrid Search

Step 4: Incorporate Search Index filtering

[
    {
      $vectorSearch: {
        index: 'image_description_vector_index',
        path: 'descriptionValues',
        queryVector: embedding,
        numCandidates: limit,
        limit: limit,
        filter: {
          userId: userId,
          brandId: brandId,
        },
      },
    },
    {
      $addFields: {
        searchType: 'vector',
        searchScore: { $meta: 'vectorSearchScore' },
      },
    },
    {
      $unionWith: {
        coll: 'user_image',
        pipeline: [
          {
            $search: {
              index: 'image_description',
              compound: {
                must: [
                  {
                    text: {
                      query: query,
                      path: 'description',
                      fuzzy: {
                        maxEdits: 1,
                        prefixLength: 1,
                      },
                    },
                  },
                ],
                filter: [
                  { text: { path: 'userId', query: userId } },
                  { equals: { path: 'deleted', value: false } },
                ],
              },
            },
          },
          { $limit: limit },
          {
            $addFields: {
              searchType: 'text',
              searchScore: { $meta: 'searchScore' },
            },
          },
        ],
      },
    },
    {
      $sort: { searchScore: -1, createdAt: -1 },
    },
    { $skip: cursor ? parseInt(cursor) : 0 },
    { $limit: limit },
    {
      $project: {
        _id: 1,
        description: 1,
        url: 1,
        userId: 1,
        createdAt: 1,
        updatedAt: 1,
        searchType: 1,
        searchScore: 1,
      },
    },
  ];

The query above leverages $unionWith to incorporate $search into the query. It uses a compound option to the pipeline to perform the text search on the description field while also doing applying a filtering by only looking at the user images

Conclusion

Implementing hybrid search with filtering in MongoDB has allowed us to provide a powerful and personalized search experience for our users. By combining vector and text search capabilities and leveraging MongoDB's aggregation framework, we've created a flexible solution that can be easily adapted to various requirements. This approach not only improves the relevance of search results by utilizing both semantic and keyword matching but also ensures data privacy by filtering results based on user ownership. As you implement this in your own projects, remember to monitor performance, especially as your data grows, and continually refine your search algorithm based on user feedback. With these tools and techniques, you can create a robust, efficient, and user-centric search functionality in your MongoDB-based applications.

DEV Community

Implementing Complex Semantic Search with MongoDB

Step 1: Creating a Vector Index

Step 2: Creating a Search Index

Step 3: Implementing the Vector Search

Step 4: Implement Hybrid Search

Conclusion

Top comments (0)

Read next

New AI Breakthrough Makes Self-Driving Cars 15x Faster and Safer with Truncated Diffusion Model

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

10 Types of AI - Detailed Guide

Why Code Reuse is Important in the Age of AI