Intro
Over the past few months, I've encountered significant challenges in accelerating API response times. This journey began as our company expanded its data volume substantially. It became apparent that numerous legacy codes in our backend could benefit from optimization, not only to cut costs but also to enhance response times. Our objective revolves around streamlining query expenses and backend logic. After implementing these optimizations, I feel compelled to document the process. This serves the dual purpose of raising awareness among my colleagues and potentially eliciting fresh insights that may have eluded me during the optimization process. Dive into the content and glean any valuable insights it may offer!
1. Left Out Unnecessary Join Query
Before delving into the first approach, it's crucial to prioritize optimizing indexing usage. Effective indexing plays a pivotal role in mitigating time complexity within our queries. However, it's imperative to exercise caution as indexing entails increased memory allocation. Therefore, meticulous management of indexing is paramount to ensure optimal performance.
If we assume that indexing has already been optimized but queries still suffer from lengthy processing times, it's prudent to scrutinize our query structure. Complex logic, especially when it necessitates relationships between numerous tables, can significantly inflate the cost of join queries. Therefore, it's essential to examine the intricacies of our query logic and identify potential areas for simplification or optimization. Imagine we have a database like this (it maybe oversimplified):
Let say we need to get persons
data that have a detailing about their pets
and who is the vet
. Also, the data would be filtered only persons
that have a specific condition (just say it condition A). Then it need to be ordered by their (in example) distance to their company. Finally the data would serve as a pagination in case 40 item per page. From the case, we maybe think out about the query like this:
SELECT
p.*,
c.*,
json_agg(json_build_object(...)) AS pets_owned
...
FROM persons p
LEFT JOIN companies c ON p.company_id = c.id
LEFT JOIN pets pt ON p.id = pt.id
...
WHERE ...
ORDER BY ...
OFFSET {page_size * page_number} ROWS
FETCH NEXT {page_size} ROWS ONLY
As the dataset grows to millions of records or more, and our operational scope expands, we find ourselves needing to join numerous related tables and implement increasingly complex logic. At this stage, the cost of JOIN operations becomes exorbitant, often with a time complexity of O(n) for each JOIN statement (potentially lower if sequential scans are not employed). Notably, we frequently do not require the joined data for all rows, particularly since our typical practice involves fetching only 40 items per page.
In addressing this scenario, we can selectively exclude JOIN queries that are not directly utilized in the main logic. For instance, consider a situation where we intend to join the pets table to access its data, but there's no immediate need to examine that data within our current logic. In such cases, we can omit the join to reduce complexity and subsequently retrieve the data through a separate query call when necessary. This approach enables us to circumvent the process of looping over a million records to perform joins before filtering. Instead, we conserve time by only iterating over the curated set of 40 records. It's important to note that certain relation tables cannot be omitted, particularly those crucial for main logics such as tables containing columns used for ordering or filtering.
2. Avoid N+1 Issue
When handling complex data, there are instances where certain data needs to be queried subsequently. However, if we're not cautious, this can result in the N+1 problem. In essence, the N+1 problem arises from excessive round trips to the database for the same query logic.
Let's revisit our previous scenario. Suppose we've successfully retrieved the queried data discussed earlier in our backend. However, as we require additional missing data, we need to make another query call. For instance, if we're coding in Java, the code might look like these:
...
Page<Person> pagePersons = personRepository.findCuratedDataPagable(...);
List<PersonProjection> listPersons = pagePersons.getContent();
for (PersonProjection personProjection : listPersons) {
Person person = new Person();
... //assign data from Projection to Object
List<Pet> pets = petRepository.findAllPetByPersonId(person.getId());
person.setPetList(pets);
...
}
... //another logic until it could be returned as response
SELECT ...
FROM pets p
... //maybe have another join logic
WHERE p.person_id = :personId
...
From the provided sample, it's evident that each execution of this function necessitates approximately 40 round trips to our database. While this number may not seem significant initially, it can escalate considerably during peak server traffic. Moreover, if we have multiple additional datasets to retrieve, the issue becomes even more pronounced. Therefore, it would be advantageous to circumvent querying inside the loop to enhance efficiency.
To optimize this process, we can employ techniques such as eager loading or batching queries to retrieve all necessary data in fewer round trips, reducing the strain on our database and improving overall performance. Additionally, caching frequently accessed data can further alleviate the burden on the database and mitigate the need for repetitive queries. These strategies ensure smoother operation, particularly during periods of heightened server activity.
Can we leverage the lessons learned in Data Structures and Algorithms classes to greatly enhance efficiency in this context? Recognizing that the data being fetched are similar but for different person IDs, it's beneficial to attempt fetching it collectively prior to entering the loop. Subsequently, we can assign it iteratively within the loop. To execute this strategy, we employ memoization to store the fetched data in the appropriate relation, such as which pets are owned by which person. The code implementation might resemble the following:
...
Page<Person> pagePersons = personRepository.findCuratedDataPagable(...);
List<PersonProjection> listPersons = pagePersons.getContent();
List<Long> personIds = listPersons.stream()
.map(Person::getId)
.collect(Collectors.toList());
Map<Long, List<Pet>> mappingPets = petRepository.findAllPetByPersonIds(personIds)
.stream()
.collect(Collectors.groupingBy(Pet::getPersonId));
for (PersonProjection personProjection : listPersons) {
Person person = new Person();
... //assign data from Projection to Object
person.setPetList(mappingPets.getOrDefault(personProjection.getId(), new ArrayList<>());
...
}
... //another logic until it could be returned as response
SELECT ...
FROM pets p
... //maybe have another join logic
WHERE p.person_id in :personIds
...
By reducing the number of database queries through collective data fetching and utilizing memoization for efficient access within the loop, we not only enhance performance but also alleviate strain on the database. For note, the process of converting List to Map only take O(n) and the process to get the value by key only take O(1).
3. Use Batching on Bulk Insert Massive Data
In addition to data retrieval, we must also address the efficient storage of data in the database. While handling a single record is straightforward, dealing with massive datasets requires careful consideration of performance implications. Efficient storage mechanisms are essential to ensure optimal performance and scalability
We can draw an analogy to illustrate this scenario. Let's envision the data we aim to store as parcels that require transportation from one location to another. When dealing with a single parcel, there's little need to strategize for efficient delivery. However, when faced with thousands of parcels, several factors come into play:
- Time required for the courier to travel from the origin to the destination.
- Increasing the number of parcels may lead to heavier loads for the courier, potentially resulting in overweight issues.
- Conversely, reducing the number of parcels per trip might necessitate additional trips, increasing the overall delivery time. Considering these factors, we must determine the optimal number of parcels the courier can carry during each trip to strike a balance between load weight, travel time, and trip frequency. This optimization ensures efficient data storage and retrieval processes.
Based on that illustration, it's evident that meticulous management of the data storage mechanism is paramount, particularly when dealing with bulk insertion of massive datasets. More or less, a sample approach that worth to try could be like this:
...
int startIndex = 0;
int batchSize = 1000;
while (startIndex < listData.size()) {
final int endIndex = Math.min(startIndex + batchSize, listData.size());
bulkInsert(listData.subList(startIndex, endIndex));
startIndex = endIndex;
}
...
In the example provided, batch processing involves inserting data in groups, with each batch containing 1,000 records. Therefore, if we have 10,000 data points to store, the process will iterate 10 times to complete the round trips.
Conclusion
Ultimately, these approaches represent just a few strategies that I've explored. It's important to recognize that there may be numerous other methods and insights yet to be discovered. It's crucial to remain open-minded, understanding that every approach carries its own set of trade-offs, and every problem warrants a unique solution. I'm grateful for any additional insights or suggestions that may further enhance our approach to these issues.
- Muhammad Azis Husein
Top comments (0)