Know Better About N+1 Queries Problem

#query

Overview

In the engineering process we often facing the case to query all the data based on it's parent and the data will be used for some reason. For example, let say there is users table that has correlation one-to-many with the transactions table, you need to get all the users and it's transactions based on the user_ids that given from argument. The simple logic that we will do is to get all the user with id include in user_ids then get all the transactions one by one

def load_data(user_ids)
    result = []
    users = User.where(id: user_ids)
    users.each do |user|
        result << { user: user, transactions: user.transactions }
    end

    result
end

it's really simple logic, but is it good enough? is it bad? is it our service can endure the high throughput? is there any way to make it more efficient?

Look Inside the Query

Let say we have this model

class User < ApplicationRecord
  has_many :transactions
end

class Transaction < ApplicationRecord
  belongs_to :user
end

we're gonna simulate the query in the rails console (run: rails console)

we see in the image, the load_data method called 4 queries to the database. The first one is query all the users based on the user_ids, and the 3 others are queries to fetch the transactions for each user.

SELECT "users".* FROM "users" WHERE "users"."id" IN (?, ?, ?)  [["id", 1], ["id", 2], ["id", 3]]
SELECT "transactions".* FROM "transactions" WHERE "transactions"."user_id" = ? /* loading for inspect */ LIMIT ?  [["user_id", 1], ["LIMIT", 11]]
SELECT "transactions".* FROM "transactions" WHERE "transactions"."user_id" = ? /* loading for inspect */ LIMIT ?  [["user_id", 2], ["LIMIT", 11]]
SELECT "transactions".* FROM "transactions" WHERE "transactions"."user_id" = ? /* loading for inspect */ LIMIT ?  [["user_id", 3], ["LIMIT", 11]]

What happen if the user_ids is so big? will we query to fetch the transactions as much as the user that we have? now we facing the N+1 queries problem

N+1 query problem?

This is common problem in the database query, it will execute the query one-by-one for all instance instead of 1 or 2 queries. In the example above we fetch all the three users data, then continue with query all the transactions for each user, it count 4 queries (1+3). If the are N users data, first it will fetch all the N users then continue to query all the transactions for each user, so it's called N+1 queries.

The problem in the N+1 queries is each query will take some amount of time, bigger data that we fetch bigger time that we need and we may facing the timeout issue. N+1 query is not good for the performance and we need find the solution

*we can ignore the N+1 query if the data is small or low throughput

Eager Load

In ruby, we have Eager load mechanism to load all the data and it's association with single query. One of the method to trigger the eager_load is .includes

def load_data_with_eager_load(user_ids)
    result = []
    users = User.includes(:transactions).where(id: user_ids)
    users.each do |user|
        result << { user: user, transactions: user.transactions }
    end

    result
end

the method above will give result

if we look closely, the load_data_with_eager_load method only trigger two query. First query get all the users and the second query get all the transactions for all users

SELECT "users".* FROM "users" WHERE "users"."id" IN (?, ?, ?)  [["id", 1], ["id", 2], ["id", 3]]
SELECT "transactions".* FROM "transactions" WHERE "transactions"."user_id" IN (?, ?, ?)  [["user_id", 1], ["user_id", 2], ["user_id", 3]]

It reduce the database query significantly, the eager load only cost 2 queries for however much data we have