DEV Community

Cover image for Partial text search on Mongo
Sagnik Bandyopadhyay
Sagnik Bandyopadhyay

Posted on • Edited on

Partial text search on Mongo

Scenario

Lets consider a Mongo collection topics that looks somewhat like:

[
  {
    "_id": 1,
    "title": "computer science"
  },
  {
    "_id": 2,
    "title": "numerical computation"
  },
  {
    "_id": 3,
    "title": "medical science"
  },
  {
    "_id": 4,
    "title": "information technology"
  },
  {
    "_id": 5,
    "title": "sci fi"
  }
  ...
]
Enter fullscreen mode Exit fullscreen mode

We would like to retrieve documents whose title "contains" comp.

Here are few possible ways that we can consider:

OPTION 1 - Simple regex query

We can perform a regex query on the title field, like this:

db.topics.find(title: /.*comp.*/i)
Enter fullscreen mode Exit fullscreen mode

OR

db.topics.find(title: {$regex: "comp", $options: "i"})
Enter fullscreen mode Exit fullscreen mode

NOTE: Option i is used to perform case insensitive search.

Mongo will need to examine every document in the collection and perform a regex match on the title field.

We can define an index (nothing fancy, the usual index) on the title field, like:

db.topics.createIndex({title: 1})
Enter fullscreen mode Exit fullscreen mode

Now if we perform the regex query, mongo will not examine all the documents, but it still needs to examine all the index keys, and then examine and retrieve only those documents that matched.

This works good for a small collection, but for large collections (having more than 100k or 1M docs) the query becomes slow, and it also consumes a lot of resources from the mongo db instances, which is definitely bad.

This solution is the simplest to implement, but doesn't scale well.

OPTION 2 - Index on tokenised words

If we are okay to support searching only by words, and not partial words, then tokenising the title field can help. E.g.:

[
  {
    "_id": 1,
    "title": "computer science",
    "titleTokens": ["computer", "science"]
  },
  {
    "_id": 2,
    "title": "numerical computation",
    "titleTokens": ["numerical", "computation"]
  },
  {
    "_id": 3,
    "title": "medical science",
    "titleTokens": ["medical", "science"]
  },
  {
    "_id": 4,
    "title": "information technology",
    "titleTokens": ["information", "technology"]
  },
  {
    "_id": 5,
    "title": "sci fi",
    "titleTokens": ["sci", "fi"]
  },
  ...
]
Enter fullscreen mode Exit fullscreen mode

While saving / updating documents, we can:

  1. convert the title field to lower case (to support case insensitive searches)
  2. split the text into multiple words
  3. save those words as an array, as titleTokens

And then we can define a multi key index on titleTokens

Now if we need to search for a text, we can:

  1. convert the search text to lower case
  2. split the search text into multiple words (using the same algorithm used to tokenise the title field)
  3. Query the array titleTokens with the tokens obtained from search text, e.g.:
db.topics.find({titleTokens: {$all: ["science"]}})
Enter fullscreen mode Exit fullscreen mode

Above query will retrieve all documents having the token "science" in the titleTokens field, i.e.: document #1 and #3

OR

db.topics.find({titleTokens: {$all: ["computer", "science"]}})
Enter fullscreen mode Exit fullscreen mode

Above query will retrieve all documents having both "computer" and "science" tokens, and hence will retrieve document #1

Using this solution we cannot search by partial words like "comp", but can search with individual words pretty fast.

OPTION 3 - Text index

Mongo has a provision for defining one text index per collection. This works pretty much the same way as OPTION 2, i.e., the title field's value will be tokenised into words and only search by full words (not partial words) is supported.

The index can be created like this:

db.topics.createIndex({title: "text"})
Enter fullscreen mode Exit fullscreen mode

This is one way to query the text index:

db.topics.find({$text: {$search: "computer"}})
Enter fullscreen mode Exit fullscreen mode

Above query will retrieve all documents having the word "computer" in the title field.

Here is another:

db.topics.find({$text: {$search: "computer science"}})
Enter fullscreen mode Exit fullscreen mode

Above query will retrieve all documents having either "computer" or "science" in the title field, i.e.: documents #1 and #3

Yet another:

db.topics.find({$text: {$search: "\"computer sci\""}})
Enter fullscreen mode Exit fullscreen mode

This is a phrase query. It will:

  • tokenise the search text into "computer" and "sci"
  • fetch documents containing either "computer" (this will match doc #1) or "sci" (this will match doc #5)
  • among all these documents, mongo will examine the title fields and check whether the exact phrase "computer sci" is present or not. only doc #1 will pass this check, which will be returned as the result of the query

Using this solution we can efficiently search for full words or for phrases containing at least one full word.

OPTION 4 - Atlas search

This is only applicable when mongo is running on atlas.

Atlas provides a very powerful way to search through documents for partial words. autocomplete indexing and searching perfectly meets our use case.

We can define a search index on topics following these steps:

  1. Choose an analyser, maybe whitespace. This will tokenise the field's value into words
  2. Disable dynamic mapping (assuming that we already know which field to index, and the field name / hierarchy doesn't change)
  3. Define explicit fields to index, in this case it can be the title field
  4. Also define two datatypes for the field:
    1. String - for complete text matching (also needed for autocomplete match if we want to increase the search score of completely matched strings)
    2. Autocomplete - for partial matching
      1. Choose the tokeniser for Autocomplete datatype:
        1. edgeGram: This will allow prefix searches against tokenised words. E.g.: searching by "comp" will return entities #1 and #2, but searching by "puter" wont return anything
        2. nGram: This will behave like a contains search. E.g.: searching by "put", will return entities #1 and #2
      2. Define:
        1. minGrams: The minimum length from which tokenisation will start. If minGrams is 2, then by searching for "c" we wont get any results, we need to have at least 2 characters. We need to be careful with this field. Setting a very low minGrams value (like 1) with nGram tokeniser will result in a huge index, and setting something very high won't be useful enough.
        2. maxGrams: The maximum length of characters which will be tokenised

Once the Atlas search index is ready, we can query the mongo collection using an aggregation pipeline like this:

db.topics.aggregate(
[
  {
    $search: {
      index: <name of atlas search index>,
      autocomplete: {
        query: "comp",
        path: "title"
      }
    }
  }
]
)
Enter fullscreen mode Exit fullscreen mode

NOTE:

  1. These search indexes can accomplish a lot more (fuzzy search, compound search, and maybe more).
  2. The search index can be defined via terraform (if we want to follow infrastructure as code :-)).

Using this solution we can efficiently search for prefixes of words or for any part of the text, if we run mongo on atlas.


Above options were easy solutions to the problem. Below are some not so easy solutions:

ELASTIC SEARCH

If we strictly do not want to use atlas search, we may consider:

  1. Continuously syncing the mongo collection (maybe only the title and id fields) to elastic search
  2. Defining appropriate autocomplete edgeGram / nGram indexes
  3. Executing queries against elastic and retrieving matched documents by id from mongo. Or we may also consider storing the whole document in elastic itself (in that case we need not retrieve docs by id from mongo as we already get the full document from elastic itself).

FROM SCRATCH !

If none of the above options work for us, we can build our own indexes using prefix trie or suffix trie, and point to mongo document ids (or any database reference) from them. Searching the index for all possibilities for a given search text may become a difficult / costly operation. We may consider caching the top results in each trie node. If space becomes an issue, we can distribute the trie across multiple nodes.

This is definitely not an easy job. We'll encounter multiple challenges / questions related to:

  • Pre-computation of search results in each trie node: What algorithm to use? How to keep the trie index in sync with the underlying mongo collection? How much latency is acceptable? How will the data infrastructure look like?
  • Trie sharding: How to evenly distribute the trie across multiple nodes?

Please feel free to suggest better ideas / corrections to this content. Thanks in advance.

Top comments (0)