What I learned from managing hundred millions of data in MongoDB

#webdev #mongodb #database

In this post, I will share real experience that I gained while working with hundred's of millions of pieces of data in MongoDB.

This post cross-published with OnePublish

Don't store all data in a single mongo collection

This was the worst mistake that we made, which became the root cause of all issues. We used to store 500 million complex structured data in single collection, that caused following side effects:

Indexes creation takes very long time.
If collection dropped accidentally, all data is gone.
Querying data takes longer than usual, even though with indexes.
Counting with filters is timing out because of large document scanning.

Probably there will be even more issues the data increases on a daily basis.

Bad document structure

Invest some time to define document structure precisely. This is a critical part that never gets attention when starting the project. Our mistake was excluding the field entirely from some of the documents instead of assigning a default value for the field.

That ended up with slow query scanning because of existence checking. In MongoDB, it's possible to give partial indexes but it's really not working that well where field existence is checked.

The example below shows null value assigned for phone instead of excluding it from the document completely.

{
  "_id": "507f191e810c19729de860ea",
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "phone":null, //assign default value instead of removing
  "isActive":true,
  "tags": ["user", "active", "new"]
  "registeredOn": "2024-04-19T15:00:00Z"
}

{$exists: true} filter is slow so consider adding default value and index the fields for fast query results.

Use Bulk Operations

Bulk operations are used to execute multiple write operations (inserts, updates, deletes) efficiently in a single command. This method can significantly reduce the number of round trips between your application and the MongoDB server, leading to performance improvements, especially when dealing with large volumes of data.

Since we were to processing millions of data that from RabbitMQ queues it requires to handle messages in batches and make bulk DB operations for each batch.

Here's the documentation for pymongo how to use bulk operations for better write performance.

Use Indexes

Indexing is playing crucial part in DB performance. Create required indexes in collection based on what filters you are using in the codebase. That also refers the part about creating valid document structure. Instead of removing the field completely assign default value to benefit from indexing power.

Instead of count use aggregation pipeline

If you need get count of data based on specific filter I wouldn't recommend using count() , instead use combination of $match and $count aggregation for fast performance.

$match filters the documents first, which means only the relevant documents that meet the specified conditions are passed down to the $count stage. This is more efficient than counting all documents and then filtering them, as count() might initially do without any conditions.

Wrap up

That's all what I have been experienced so far with MongoDB. I will update the article once I find more hints and best practices to help optimise the performance.