I came across a problem this week, that I was sure, had a canonical solution (i.e. standard and documented). But no, not that I could find. So yet again a process of reading, searching, testing, trying, talking (on Stack Overflow, with ChatGPT) and resulting notes and even a sandbox (demo).
Problem Statement
The context: A database (nominally behind a website) containing Things, and those Things are in Categories.
The environment: Django as a web framework and Python as a backend, but raw SQL will do if needed.
In the Django context the simplest scenario is:
from django.db import models
class Thing(models.Model):
name = models.CharField('Name of the Thing', max_length=100)
class Category(models.Model):
name = models.CharField('Name of the Category', max_length=100)
things = models.ManyToManyField(Thing,
verbose_name='Things',
related_name='categories')
Which of course is backed by three tables in a (nominally, Postgresql, though I did trials in SQLite) database (in pro forma - Django will actually use slightly different names depending on context for such tables):
-
thing
which has one thing per row -
category
- which has one category per row -
thing_category
- which has a (thing id, category id) row for each relationship between things and categories.
You might prefer to think of things and categories in some other way more familiar to you (anything at all, with a many-to-many relationship suffices), perhaps students and courses, papers and authors, people and interests, food and its properties ... whatever really, I have several applications so asked this question in the general terms of things and categories.
The task: Given a list of categories, say [1, 4, 7, 23], write one query which returns all the things that are in all of these categories (but not limited to them, i.e. can be in other categories as well).
In turns out that the partner problem, finding all the things that are in any of those categories is easy. In Django:
things = Thing.objects.filter(categories__in=[1, 4, 7, 23])
this returns duplicate things, a wonderfully bizarre feature of Django, and to prevent that one needs to add although...
.distinct()
This, in effect, is an ANY/OR result, returning all things that are in any of the listed categories, alternately, all things that are in category 1 or category 4 or category 7 or category 23.
What I need is an ALL/AND result, returning all things that in all of the listed categories, alternately, all things that are category 1 and category 4 and category 7 and category 23.
Solutions
TLDR: solution
cats = [1, 4, 7, 23]
things = Thing.objects.annotate(
cat_count=Count('id', filter=Q(categories__in=cats))
).filter(cat_count=len(cats))
Field Lookups
categories__in
is what Django calls a field lookup. They are very handy and the first port-of-call was to try and find a field lookup that does this. Alas none was to be found. The __in
lookup is used to perform an ANY/OR match. There is no equivalent for ALL/AND lookups like say __in_all
- alas. It's possible to write custom lookups and I have done that before, and was tempted here, but the SQL solution is non-trivial to implement in a custom lookup (can be done, but the method is poorly documented and would need more research).
Turns out the SQL should I (or someone else) choose to build a lookup like __in_all
can work on the joining table alone, so with the tables and example above:
SELECT thing_id
FROM thing_category
WHERE category_id in (1, 4, 7, 23)
GROUP BY thing_id
HAVING Count(*) = 4;
This works well because the WHERE clause excludes all other categories, so only those in the list will be counted by the Count(*) function. To wit, this fetches all the relations that are (at least) in the listed categories (they may be in more). Precisely what I am targeting. Of course, this assumes there are no duplicate rows in the table (a safe assumption with Django joining tables).
Interestingly, the __in
lookup which returns the ANY/OR set of things can be written in similar SQL as:
SELECT thing_id
FROM thing_category
WHERE category_id in (1, 4, 7, 23)
GROUP BY thing_id
HAVING Count(*) >= 1;
In lieu of a custom lookup or raw SQL query though, a better Django solution is preferred.
Filtered Counts
Django filters are often a challenge to write. It is both a wonderful abstraction of the underlying database and, at the same time, a very different approach and way of thinking. There's always this small challenge with Django, if you have SQL you know works, translating that to Django filter.
It turns out the Django Count aggregate function has a (poorly documented) filter argument.
This:
things = Thing.objects.annotate(cat_count=Count('categories'))
is fairly standard Django (that many will recognise), which adds a count of the categories as an attribute named cat_count
to each of returned things
.
But we don't want a count of the categories every thing is in, we want a count of the categories from our list that each thing is in. It turns out Django supports that through the rather poorly documented filter addition to the Count aggregate.
So we can limit the count to that list of desire categories using a Count filter, and that does use a field lookup! As in:
cats = [1, 4, 7, 23]
things = Thing.objects.annotate(cat_count=Count('categories', filter=Q(categories__in=cats)))
And now we have all things annotated with a count of categories from 0 to 4 (the length of cats) as the Count was only on a filtered set of them.
To get all things that are in ALL categories we can filter for those that have 4 reported, so:
cats = [1, 4, 7, 23]
things = Thing.objects.annotate(cat_count=Count('categories', filter=Q(categories__in=cats)))
things = things.filter(cat_count=len(cats))
Revisiting ANY/OR (a digression)
Ironically, we of course replicate what the __in
lookup does using this annotation too, so that this:
cats = [1, 4, 7, 23]
things = Thing.objects.filter(categories__in=cats)
is almost the same (as in, produces the same results as):
cats = [1, 4, 7, 23]
things = Thing.objects.annotate(cat_count=Count('categories', filter=Q(categories__in=cats)))
things = things.filter(cat_count__gte=1)
Which uses the __gte
field lookup. Alas, the first cut, duplicates things, Django does a lazy join and then filter on the join table, so things appear once per matching category. Doh! Like that would be of use to anyone, ever, in any context? So We have to help Django do the sensible thing and explicitly request distinct things:
cats = [1, 4, 7, 23]
things = Thing.objects.filter(categories__in=cats).distinct()
And now the two queries return near identical results (the ordering can be different unless we specify it).
Drilling down to SQL
We can of course ask Django for the SQL it generates (turns out that's not trivial but that is a digression for another article). And it's interesting to compare the SQL that the __in
lookup generates compared with our annotating approach.
Here's what an __in lookup generates (in pro forma):
SELECT DISTINCT thing.id, thing.name
FROM thing
INNER JOIN category_things ON (thing.id = category_things.thing_id)
WHERE category_things.category_id IN (1, 4, 7, 23)
and here's what our annotation approach produces (in pro forma):
SELECT
thing.id,
thing.name,
COUNT(category_things.category_id)
FILTER (WHERE category_things.category_id IN (1, 4, 7, 23))
AS "cat_count"
FROM thing
LEFT OUTER JOIN
category_things ON (thing.id = category_things.thing_id)
GROUP BY
thing.id, thing.name
HAVING
COUNT(category_things.category_id)
FILTER (WHERE (category_things.category_id IN (1, 4, 7, 23)))
>= 1
Django's SQL clearly does not adhere to DRY, given it reproduces COUNT(category_things.category_id) FILTER (WHERE category_things.category_id IN (1, 4, 7, 23))
in two places while this is also valid (and tested, it works):
SELECT
thing.id,
thing.name,
COUNT(category_things.category_id)
FILTER (WHERE category_things.category_id IN (1, 4, 7, 23))
AS "cat_count"
FROM thing
LEFT OUTER JOIN
category_things ON (thing.id = category_things.thing_id)
GROUP BY
thing.id, thing.name
HAVING
cat_count >= 1
caveat
this may be an artefact of the SQL drill down, or generation, which as I said is not trivial in Django, but the subject for another article, point here being is we are not necessarily seeing here the SQL Django actually sends to the database engine - though in all honesty it pretty likely is
For completeness of course the SQL for the ALL/AND request that I was pursuing at outset becomes:
SELECT
thing.id,
thing.name,
COUNT(category_things.category_id)
FILTER (WHERE category_things.category_id IN (1, 4, 7, 23))
AS "cat_count"
FROM thing
LEFT OUTER JOIN
category_things ON (thing.id = category_things.thing_id)
GROUP BY
thing.id, thing.name
HAVING
cat_count = 4
The Research Effort
For the record, the trail of research that went into the notes, that morphed into this article, on realisation that esoteric snippets of acquired knowledge are worth recording and sharing are:
Top comments (0)