We continue our journey into how Metis can prevent, monitor, and troubleshoot our databases. In previous part we started playing with Sequelize, a feature-rich ORM for TypeScript. This time we’re going to see more challenging examples. We are going to see actual insights provided by Metis and how they improve the query performance.
Introduction
Sequelize is an ORM for Node.js and TypeScript. It can connect to PostgreSQL and other SQL engines. It supports transactions, relations, eager loading, lazy loading, functions, and other common operations. Previous part covered some of the queries from the first part of this series. This time we are going to implement more sophisticated scenarios to see how Sequelize can deal with them. Let’s go.
Given two people, list what movies they appeared in together
Let’s find common movies for two people. That’s the application code doing that:
return titleBasic
.findAll({
include: [
{
model: titlePrincipal,
required: true,
duplicating: false,
as: 'titleBasicTitlePrincipal',
where: {
'nconst': actor1
}
},
{
model: titlePrincipal,
required: true,
duplicating: false,
as: 'titleBasicTitlePrincipal2',
where: {
'nconst': actor2
}
},
]
});
The code generates the following query:
SELECT
TitleBasic.tconst,
TitleBasic.titletype,
TitleBasic.primarytitle,
TitleBasic.originaltitle,
TitleBasic.isadult,
TitleBasic.startyear,
TitleBasic.endyear,
TitleBasic.runtimeminutes,
TitleBasic.genres,
titleBasicTitlePrincipal.tconst AS titleBasicTitlePrincipal.tconst,
titleBasicTitlePrincipal.ordering AS titleBasicTitlePrincipal.ordering,
titleBasicTitlePrincipal.nconst AS titleBasicTitlePrincipal.nconst,
titleBasicTitlePrincipal.category AS titleBasicTitlePrincipal.category,
titleBasicTitlePrincipal.job AS titleBasicTitlePrincipal.job,
titleBasicTitlePrincipal.characters AS titleBasicTitlePrincipal.characters,
titleBasicTitlePrincipal2.tconst AS titleBasicTitlePrincipal2.tconst,
titleBasicTitlePrincipal2.ordering AS titleBasicTitlePrincipal2.ordering,
titleBasicTitlePrincipal2.nconst AS titleBasicTitlePrincipal2.nconst,
titleBasicTitlePrincipal2.category AS titleBasicTitlePrincipal2.category,
titleBasicTitlePrincipal2.job AS titleBasicTitlePrincipal2.job,
titleBasicTitlePrincipal2.characters AS titleBasicTitlePrincipal2.characters
FROM
imdb.title_basics AS TitleBasic
INNER JOIN imdb.title_principals AS titleBasicTitlePrincipal ON TitleBasic.tconst = titleBasicTitlePrincipal.tconst
AND titleBasicTitlePrincipal.nconst = 'nm0302368'
INNER JOIN imdb.title_principals AS titleBasicTitlePrincipal2 ON TitleBasic.tconst = titleBasicTitlePrincipal2.tconst
AND titleBasicTitlePrincipal2.nconst = 'nm0001908';
Metis indicates a missing index:
The query takes 5 seconds to finish. We can now improve the query performance by adding an index for title_principals. This is the code to do that:
CREATE INDEX IF NOT EXISTS title_principals_nconst_idx ON imdb.title_principals(nconst) INCLUDE (tconst)
Notice that we had to configure two associations in the model to make this work. This is very misleading and should be avoided if possible. What if we don’t like the second association? We can implement joins in the application code, like this:
const first = titlePrincipal
.findAll({
attributes: ['tconst'],
where: {
'nconst': actor1
}
}).then(titles => titles.map(t => t.tconst));
const second = titlePrincipal
.findAll({
attributes: ['tconst'],
where: {
'nconst': actor2
}
}).then(titles => titles.map(t => t.tconst));
return first.then(firstTitles => second.then(secondTitles => {
return titleBasic
.findAll({
where: {
tconst: {
[Op.and]: [
{
[Op.in]: [...firstTitles]
},
{
[Op.in]: [...secondTitles]
}
]
}
}
});
}));
We first load titles for the first person, then load titles for the second person. Finally, we process the lists together, and we send another query that would use both of them. This is the interaction with the database:
First query:
SELECT
tconst
FROM
imdb.title_principals AS TitlePrincipal
WHERE
TitlePrincipal.nconst = 'nm0302368';
Second query:
SELECT
tconst
FROM
imdb.title_principals AS TitlePrincipal
WHERE
TitlePrincipal.nconst = 'nm0001908';
These two queries are rather straightforward. We could then intersect the datasets in the application code, or we can ask the database to do so. No matter what we do, we end up with a query similar to this one:
SELECT
tconst,
titletype,
primarytitle,
originaltitle,
isadult,
startyear,
endyear,
runtimeminutes,
genres
FROM
imdb.title_basics AS TitleBasic
WHERE
(
TitleBasic.tconst IN (
'tt0000439','tt0014222','tt0021457','tt0029541','tt0447157','tt0020221','tt0020275','tt0460244','tt0829197','tt0008572','tt0009046','tt0040991','tt0155701'
)
AND TitleBasic.tconst IN (
'tt0000439','tt0000452','tt0000501','tt0000528','tt0000578',...
)
);
This scenario is now slower. Processing results in application takes 9 seconds to complete, so it’s slower than plain SQL. We can’t tell that it’s always the case, as sometimes it’s faster to process data in the application, especially when we don’t extract many records from the SQL. Your mileage may vary, so always measure your solution.
List all of the cast and crew in a given movie
Let’s now find all the people involved in the movie. Here is the first solution. We can just send the raw query:
return sequelize.query(`
SELECT DISTINCT NB.*
FROM imdb.title_basics AS TB
LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
LEFT JOIN imdb.title_crew AS TC ON TC.tconst = TB.tconst
LEFT JOIN imdb.name_basics AS NB ON
NB.nconst = TP.nconst
OR TC.directors = NB.nconst
OR TC.directors LIKE NB.nconst || ',%'::text
OR TC.directors LIKE '%,'::text || NB.nconst || ',%'::text
OR TC.directors LIKE '%,'::text || NB.nconst
OR TC.writers = NB.nconst
OR TC.writers LIKE NB.nconst || ',%'::text
OR TC.writers LIKE '%,'::text || NB.nconst || ',%'::text
OR TC.writers LIKE '%,'::text || NB.nconst
WHERE TB.tconst = :tconst
`, {
model: nameBasic,
mapToModel: true,
replacements: {
tconst: tconst
}
});
This works, but is inefficient as we identified in the previous part. We can do the query with union to speed things up:
return sequelize.query(`
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON NB.nconst = TP.nconst
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE NB.nconst || ',%'::text
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE '%,'::text || NB.nconst || ',%'::text
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE '%,'::text || NB.nconst
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.writers = NB.nconst
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE NB.nconst || ',%'::text
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE '%,'::text || NB.nconst || ',%'::text
UNION
SELECT DISTINCT NB.*
FROM imdb.title_principals AS TP
JOIN (
SELECT tconst, directors, writers
FROM imdb.title_crew
WHERE tconst = :tconst
) AS TC ON TC.tconst = TP.tconst
LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE '%,'::text || NB.nconst
`, {
model: nameBasic,
mapToModel: true,
replacements: {
tconst: tconst
}
});
}
They take 100 seconds and 25 seconds respectively. That’s not very fast. Let’s try implementing the same in the application code:
const crewViaTitlePrincipals = titlePrincipal
.findAll({
attributes: ['nconst'],
where: {
'tconst': tconst
}
}).then(crew => crew.map(c => c.nconst));
const crewViaTitleCrew = titleCrew
.findAll({
where: {
'tconst': tconst
}
});
const crewMatchingNames = crewViaTitleCrew.then(crew => crew.flatMap(c => [
c.directors.split(','),
c.writers.split(',')
].flat()));
const allMatchingNames = crewViaTitlePrincipals.then(crew1 => crewMatchingNames.then(crew2 => new Set([crew1, crew2].flat())));
return allMatchingNames.then(names => nameBasic
.findAll({
where: {
'nconst': { [Op.in]: [...names] }
}
}));
We first take the crew from the title_principals
table. Next, we extract the crew from the title_crew
. Finally, we split directors
and writers
, and we calculate the intersection of both datasets to get the names.
This generates the following trace in Metis:
It’s all green! We can see that we have three queries sent, but they are blazingly fast because of indexes. This time it was much better to do processing in the application instead on the SQL level.
As mentioned in the previous section, there is no silver bullet. You need to measure your code and see where the performance benefits are.
Find the most prolific actor in a given period
Let’s now find the person that starred in the most movies in a given period. Let’s do that using application code:
const titlesMatchingPeriod = titleBasic
.findAll({
attributes: ['tconst'],
where: {
startyear: {
[Op.and]: [
{ [Op.gte]: startYear },
{ [Op.lte]: endYear }
]
}
}
}).then(titles => titles.map(t => t.tconst));
const principals = titlesMatchingPeriod.then(titles => titlePrincipal
.findAll({
attributes: ['nconst'],
where: {
tconst: { [Op.in]: [...new Set(titles)] }
}
}).then(principals => {
const counts = principals
.reduce(
(entryMap, e) => {
entryMap[e.nconst] = (entryMap[e.nconst] || 0) + 1;
return entryMap;
},
{}
);
const keys = Object.keys(counts);
const countsWithKeys = keys.map(k => [counts[k], k]);
countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
const topResults = countsWithKeys.splice(0,1);
return topResults;
})
);
return principals.then(countsWithKeys => nameBasic
.findAll({
where: {
nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
}
}).then(actors => actors.map(a => {
a.movies_count = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
return a;
})));
We first find titles in a given period. Next, we take the crew for each of them. Finally, we need to group the identifiers and get the top one, and get the details. This is a bit lengthy and is rather slow because we get all the movies in a given period. That’s a lot of data:
Metis indicates nearly 70 000 rows. That’s a lot to process. What’s more, we then take all those identifiers and send in another query which is nearly 100 000 characters long.
Can we do better? Yes, we can send a raw query:
return sequelize.query(`
WITH best_actor AS (
SELECT TP.nconst, COUNT(*) AS number_of_titles
FROM imdb.title_basics AS TB
LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
WHERE TB.startyear >= :startyear AND TB.startyear <= :endyear AND TP.nconst IS NOT NULL
GROUP BY TP.nconst
ORDER BY number_of_titles DESC
LIMIT 1
)
SELECT BA.nconst, BA.number_of_titles, NB.primaryname, nb.birthyear, NB.deathyear, nb.primaryprofession
FROM best_actor AS BA
JOIN imdb.name_basics AS NB ON NB.nconst = BA.nconst
`, {
model: nameBasic,
mapToModel: true,
replacements: {
startyear: startYear,
endyear: endYear
}
});
Metis shows the following:
There are two tables that we can configure indexes on: title_principals
and title_basics
. Once we do that, we get the best performance.
Find most prolific actors in a given genre
Let’s now find actors that did the most movies in a given genre. This is a very similar scenario to the previous one. Let’s start with the application code again:
const titlesMatchingGenre = titleBasic
.findAll({
attributes: ['tconst', 'genres'],
where: {
genres: { [Op.like]: '%' + genre + '%' }
}
}).then(titles => titles
.filter(t => t.genres.split(',').indexOf(genre) >= 0)
.map(t => t.tconst)
);
const principals = titlesMatchingGenre.then(titles => titlePrincipal
.findAll({
attributes: ['nconst'],
where: {
tconst: { [Op.in]: [...new Set(titles)] }
}
}).then(principals => {
const counts = principals
.reduce(
(entryMap, e) => {
entryMap[e.nconst] = (entryMap[e.nconst] || 0) + 1;
return entryMap;
},
{}
);
const keys = Object.keys(counts);
const countsWithKeys = keys.map(k => [counts[k], k]);
countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
const topResults = countsWithKeys.splice(0,10);
return topResults;
})
);
return principals.then(countsWithKeys => nameBasic
.findAll({
where: {
nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
}
}).then(actors => actors.map(a => {
a.movies_count = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
return a;
})));
This is nearly the same as before, only this time we query for genre. Metis shows the following:
This is even worse than the previous scenario. We now extract around 400 000 rows. While it takes 2.5 seconds to execute the query, sending the data over the wire takes much longer.
To fix that, we can use the raw query:
return sequelize.query(`
WITH best_actors AS (
SELECT TP.nconst, COUNT(*) AS movies_count
FROM imdb.title_basics AS TB
LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
WHERE TB.genres = :genre OR TB.genres LIKE (:genre || '%,') OR TB.genres LIKE ('%,' || :genre || ',%') OR TB.genres LIKE ('%,' || :genre)
GROUP BY TP.nconst
ORDER BY movies_count DESC
LIMIT 10
)
SELECT BA.nconst, NB.primaryname, NB.birthyear, BA.movies_count
FROM best_actors AS BA
JOIN imdb.name_basics AS NB ON NB.nconst = BA.nconst
ORDER BY movies_count DESC
`, {
model: nameBasic,
mapToModel: true,
replacements: {
genre: genre
}
});
This gives the following:
Finally, we can configure indexes and make this query much faster.
Finding most common coworkers
Let’s now try to find people that work together the most. We can do that with a raw query:
return sequelize.query(`
WITH RECURSIVE numbers AS (
SELECT 1 AS number
UNION ALL
SELECT number + 1 AS number FROM numbers WHERE number < 1500
),
titles_for_person AS (
SELECT TC.tconst
FROM imdb.title_crew AS TC
WHERE directors = :nconst OR directors LIKE :nconst || ',%' OR directors LIKE '%,' || :nconst || ',%' OR directors LIKE '%,' || :nconst
UNION
SELECT TC.tconst
FROM imdb.title_crew AS TC
WHERE writers = :nconst OR writers LIKE :nconst || ',%' OR writers LIKE '%,' || :nconst || ',%' OR writers LIKE '%,' || :nconst
UNION
SELECT tconst
FROM imdb.title_principals
WHERE nconst = :nconst
),
titles_corresponding AS (
SELECT TC.tconst, TC.directors, TC.writers
FROM imdb.title_crew AS TC
JOIN titles_for_person AS TFP ON TFP.tconst = TC.tconst
),
split_associations AS (
SELECT TC.tconst, SPLIT_PART(TC.directors, ',', N.number) AS nconst
FROM titles_corresponding AS TC
CROSS JOIN numbers AS N
WHERE directors IS NOT NULL AND CHAR_LENGTH(directors) - CHAR_LENGTH(REPLACE(directors, ',', '')) + 1 >= N.number
UNION
SELECT TC.tconst, SPLIT_PART(TC.writers, ',', N.number) AS nconst
FROM titles_corresponding AS TC
CROSS JOIN numbers AS N
WHERE writers IS NOT NULL AND CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')) + 1 >= N.number
),
all_associations AS (
SELECT SA.tconst, SA.nconst
FROM split_associations AS SA
UNION
SELECT TP.tconst, TP.nconst
FROM imdb.title_principals AS TP
JOIN titles_for_person AS TFP ON TFP.tconst = TP.tconst
),
other_people AS (
SELECT nconst
FROM all_associations
WHERE nconst != :nconst
),
top_peers AS (
SELECT OP.nconst, COUNT(*) as common_titles
FROM other_people AS OP
GROUP BY nconst
ORDER BY common_titles DESC
LIMIT 5
)
SELECT TP.nconst, TP.common_titles, NB.*
FROM top_peers AS TP
JOIN imdb.name_basics AS NB ON NB.nconst = TP.nconst
ORDER BY TP.common_titles DESC
`, {
model: nameBasic,
mapToModel: true,
replacements: {
nconst: nconst
}
});
Metis shows the following:
We can see multiple table scans and millions of rows read. We can optimize that by adding indexes. Let’s see if we can make it faster just by running the code in application:
const titlesPrincipalMatchingPerson = titlePrincipal
.findAll({
attributes: ['tconst'],
where: {
nconst: nconst
}
}).then(titles => titles.map(t => t.tconst));
const otherTitlePrincipals = titlesPrincipalMatchingPerson.then(titles => titlePrincipal
.findAll({
attributes: ['nconst'],
where: {
tconst: { [Op.in]: titles },
nconst: { [Op.ne]: nconst }
}
})).then(titles => titles.map(t => t.nconst));
const titleCrewMatchingPerson = titleCrew
.findAll({
where: {
[Op.or]: [
{ directors: { [Op.like]: '%' + nconst + '%' } },
{ writers: { [Op.like]: '%' + nconst + '%' } }
]
}
}).then(titles => {
return titles
.filter(t => (t.directors || "").split(",").indexOf(nconst) >= 0 || (t.writers || "").split(",").indexOf(nconst) >= 0)
.map(t => [...new Set([(t.directors || "").split(","), (t.writers || "").split(",")].flat())].filter(n => n != nconst && n != "" && n))
});
const allTeammates = Promise.all([otherTitlePrincipals, titleCrewMatchingPerson]).then(nconsts => {
return nconsts.flat().filter(n => n && n != "");
});
const topTeammates = allTeammates.then(nconsts => {
const counts = nconsts
.reduce(
(entryMap, e) => {
entryMap[e] = (entryMap[e] || 0) + 1;
return entryMap;
},
{}
);
const keys = Object.keys(counts);
const countsWithKeys = keys.map(k => [counts[k], k]);
countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
const topResults = countsWithKeys.splice(0,5);
return topResults;
});
return topTeammates.then(countsWithKeys => nameBasic
.findAll({
where: {
nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
}
}).then(actors => actors.map(a => {
a.common_titles = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
return a;
})));
First, we want to take that data from the title_principals
table. We start with a known nconst
, find all movies for a given person, and then for all the movies we find other people.
Next, we take similar data from title_crew
. We then parse the data, aggregate it, and find top teammates. Finally, we get the data from the name_basics
table.
We see that Metis identifies indexes that may improve the scenario. Even without these indexes, the query running in application finishes in 6 seconds vs 19 seconds with raw query. That’s a good starting point. Once again, we need to add improvements and measure them to see how things work in practice.
Summary
We examined various real scenarios with Sequelize to see how we can improve the performance. In previous parts of this series we relied on the database only to make it faster. This time we can do some processing with ORM and the application to change the performance characteristics. Metis shows good insights into all what we do, so we can track the optimizations and performance benefits as we proceed with the implementation changes. We can do the same with our actual applications even if we have very few rows in the database.
Top comments (0)