Two edge cases in PostgreSQL full-text search

#postgres

We widely use PostgreSQL full-text search in our projects. It is fast, reliable, and doesn't add any additional technical complexity. But sometimes it may not work as you expect it too.

Usually we add a ts_vector column right in our data tables. Let's see how it has it is done.

title	tsv
Handyman	‘handyman’:1
Heating & Cooling	‘cool’:2 ‘heat’:1
Painting	‘paint’:1
Roofing	‘roof’:1

Column tsv is generated with to_tsvector function from a title column with a trigger function.

CREATE FUNCTION skills_tsv_update_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
    to_tsvector('pg_catalog.english', COALESCE(new.title, ''))
  return new;
end
$$ LANGUAGE plpgsql;

Then we can query this table using to_tsquery fuction.

SELECT * FROM skills
WHERE
    tsv @@ to_tsquery('pg_catalog.english', 'roof')

-- 1 record found

A record with Roofing has been found, and this makes sense. But what should be found for handy query? You can imagine that it will be a Handyman skill and you are wrong. Before explaining why this happens, let's try another query.

SELECT * FROM skills
WHERE
    tsv @@ to_tsquery('pg_catalog.english', 'roofing')

-- no records found

No records were found, even if we have a record with Roofing title in our table.

The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. For example, roofing became roof. Some words are recognized as stop words, like & in our example. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configuration english for the English language.

to_tsquery function parses user input and converts it to tsquery data type. It uses the same dictionaries as to_tsvector function. Let's try to understand what happens in two queries that were listed above.

The result of to_tsquery('pg_catalog.english', 'handy') will be handi because of pluralization rules for words ending on y. The query listed below doesn't return any records because 'handyman':1 vector does not overlap with handi query.

SELECT * FROM skills
WHERE
    tsv @@ to_tsquery('pg_catalog.english', 'handy')

-- no records found

How we can cover this situation? As you remember you can change a configuration that is used to parse user input. Let's try a simple configuration.

select to_tsquery('pg_catalog.simple', 'handy')

-- 'handy'

Looks like this is what we need, let's add this to our query.

SELECT * FROM skills
WHERE
    tsv @@ to_tsquery('pg_catalog.english', 'handy')
    OR
    tsv @@ to_tsquery('pg_catalog.simple', 'handy')    

-- 1 record found

A problem with Roofing and roof can be solved in the same way. Let's change the function that is used to populate tsv column.

CREATE FUNCTION skills_tsv_update_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
    to_tsvector('pg_catalog.english', COALESCE(new.title, '')) ||
    to_tsvector('pg_catalog.simple', COALESCE(new.title, ''))
  return new;
end
$$ LANGUAGE plpgsql;

title	tsv
Handyman	‘handyman’:1,2
Heating & Cooling	‘cool’:2 ‘cooling’:4 ‘heat’:1 ‘heating’:3
Painting	‘paint’:1 ‘painting’:2
Roofing	‘roof’:1 ‘roofing’:2

Now vector has more data and query below works as expected.

SELECT * FROM skills
WHERE
    tsv @@ to_tsquery('pg_catalog.english', 'roofing')

-- 1 record found

DEV Community

Two edge cases in PostgreSQL full-text search

Top comments (0)