The basic question is whether predicates in a nested
query operator considers one document at a time (it should according to the docs) and more importantly whether multiple conjunctive (AND) or disjunctive (OR) operations are evaluated and composed on independent records.
Here is the mapping we will test. Let’s assume the index is called shapes
. We are only concerned about the points that make up the shape so other properties are not included in this example.
{
"mappings": {
"_doc": {
"properties": {
"points": {
"type": "nested",
"properties": {
"x": {
"type": "integer"
},
"y": {
"type": "integer"
}
}
}
}
}
}
}
Here are a few shape records we will use to test.
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2},
{"x": 2, "y": 1},
{"x": 2, "y": 2}
]}
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2}
]}
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2},
{"x": 2, "y": 1}
]}
To re-iterate, the question is whether nested
queries (which points
is a nested datatype in this case) considers all sub-records in an array.
To get started, consider the following query:
{
"query": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}}
]
}
}
}
}
}
This translates to "find me all the shapes having at least one point where x = 2
". In the records above, the second shape record does not have a point where x = 2
and Elasticsearch (ES) correctly excludes this from the result.
The second test added a requirement that the point must also have y = 2
rather than any value.
{
"query": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}},
{"term": {"points.y": 2}}
]
}
}
}
}
}
The first shape satisfies this requirement and ES correctly returns only this record.
Now we will try a seemingly, subtly different query. The must
terms will be put in separate nested
expressions.
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "points",
"query": {
"term": {"points.x": 2}
}
}
},
{
"nested": {
"path": "points",
"query": {
"term": {"points.y": 2}
}
}
}
]
}
}
}
The distinction here is that each record is evaluated against each nested
operand independently whereas the above query required a single point to satisfy both conditions.
We can further confirm this independent behavior through the following query which requests shapes "having a point x = 1, y = 1
and at least one point where x = 2
, but not the point x = 2, y = 2
".
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 1}},
{"term": {"points.y": 1}}
]
}
}
}
},
{
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}}
]
}
}
}
}
],
"must_not": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}},
{"term": {"points.y": 2}}
]
}
}
}
}
}
}
}
Note the last two nested
operands overlap in that both will match on points where x = 2
. All three shapes satisfy the first must
condition, but only shapes 1 and 3 satisfy the second must
condition. The must_not
condition excludes shape 1 since it contains the point x = 2, y = 2
. Therefore the ES result only contains shape 3.
Comparison with SQL
For those familiar with relational databases and how these kinds of queries are expressed in SQL, conditioning on more than one record of a one-to-many relationship is not intuitive at first, but is easy to learn and becomes obvious.
If we were to model this data in a relational form, we would have the following tables:
create table "shapes" (
id integer primary key
);
create table "points" (
shape_id integer references "shapes" ("id"),
x integer,
y integer
);
Since a shape can have many points, it needs to be modeled as a relation that references the shape record the point pertains to (this could be further normalized to not duplicate any unique point, but this optimization is not important to this post).
The insert statements are as follows which utilizes Postgres' multi-row VALUES
syntax.
insert into "shapes" (id) values
(1), (2), (3);
insert into "points" (shape_id, x, y) values
(1, 1, 1), (1, 1, 2), (1, 2, 1), (1, 2, 2),
(2, 1, 1), (2, 1, 2),
(3, 1, 1), (3, 1, 2), (3, 2, 1);
If we were to express the first ES query above in SQL, we would write:
select
shapes.id
from
shapes
inner join points
on shapes.id = points.shape_id
where
points.x = 2
This would return shapes 1 and 3 as ES did
The second query is equally simple by adding the condition on y
for the same point.
select
shapes.id
from
shapes
inner join points
on shapes.id = points.shape_id
where
points.x = 2 and
points.y = 2
What about the third query? As a reminder, we split these two conditions into separate ES nested
queries which, based on the documented semantics, should operate independently across records. Modeling this in SQL requires expressing independent relations between shapes and points.
select
shapes.id
from
shapes
inner join points p1
on shapes.id = p1.shape_id
inner join points p2
on shapes.id = p2.shape_id
where
p1.x = 2 and
p2.y = 2
The difference here is that each condition (in the where
clause) applies to a different points
relation and thus does not require a single record in either relation to satisfy both conditions, but rather at least one record in both relations to satisfy one of the conditions.
What about the third query?
select
shapes.id
from
shapes
inner join points p1
on shapes.id = p1.shape_id
inner join points p2
on shapes.id = p2.shape_id
inner join points p3
on shapes.id = p3.shape_id
where
p1.x = 1 and p1.y = 1 and
p2.y = 2 and
not (p3.x = 2 and p3.y = 2)
Another join is made to the points
table yielding a new relation, but again this means that any conditions on this relation apply independently from the other joined relations. In this case, the third relation asserts that a shape does not contain a point x = 2, y = 2
.
Given this comparison we can assert that conditions in a nested
operand operate over the collection of records independently from other nested
operands and thus are semantically equivalent to independent joins as represented in the SQL statement above.
Top comments (0)