Exceptions and wordforms are two useful tools built into Manticore Search, which you can use to improve search recall and precision. They have a lot in common, but also have important differences that I’d like to cover in this article.
About tokenization
What’s the difference between full-text search (also called free-text search) and wildcard kinds of search such as:
- commonly known
LIKE
operator in this or that form - or more complex regular expressions
? Of course there are tons of differences, but it all starts with what we do with the initial input text in each of the approaches:
- with the wildcard search approach we normally consider the text as a whole
- while in the area of full-text search it’s essential to first tokenize the text and then consider each token as a separate entity
When you want to tokenize text you need to decide how to do it, in particular:
- What should be separators and word characters. Normally a separator is a character that doesn’t occur inside a word, for example punctuation marks:
.
,,
,?
,!
,-
etc. - Whether the tokens’ letter case should be retained or not. Normally it’s not since it’s bad for search that you don’t find
Orange
by the keywordorange
.
Manticore does it all automatically. For example, text “What do I have? The list is: a cat, a dog and a parrot.
” gets tokenized into:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | what | what |
| 2 | do | do |
| 3 | i | i |
| 4 | have | have |
| 5 | the | the |
| 6 | list | list |
| 7 | is | is |
| 8 | a | a |
| 9 | cat | cat |
| 10 | a | a |
| 11 | dog | dog |
| 12 | and | and |
| 13 | a | a |
| 14 | parrot | parrot |
+------+-----------+------------+
14 rows in set (0.00 sec)
As you can see:
- the punctuation marks were removed
- and all the words were lowercased
Problem
Here comes the first problem: in some cases separators are considered regular word characters, for example in “Is c++ the most powerful language?
” it’s obvious that c++
is a separate word. It’s easy to understand for people, but not for full-text search algorithms, since it sees the plus sign, doesn’t find it in it’s list of word characters and removes it from the token, so you end up with:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('Is c++ the most powerful language?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c | c |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
+------+-----------+------------+
6 rows in set (0.00 sec)
OK, but what’s the problem?
The problem is that after this tokenization if you search for c#
, for example, you will find the above sentence:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> insert into t values(0,'Is c++ the most powerful language?');
mysql> select highlight() from t where match('c#');
+-------------------------------------------+
| highlight() |
+-------------------------------------------+
| Is <b>c</b>++ the most powerful language? |
+-------------------------------------------+
1 row in set (0.01 sec)
It happens because c#
is also tokenized to just c
and then the c
from the search query matches the c
from the document and you get it.
What’s the solution? There are a few options. The first one which probably comes to mind is:
OK, why don’t I put + and # to the word characters list?
It’s a good and fair question. Let’s try.
mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('Is c++ the most powerful language?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
+------+-----------+------------+
6 rows in set (0.00 sec)
It works, but +
in the list immediately starts affecting other words and searches, for example:
mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('What is 2+2?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | what | what |
| 2 | is | is |
| 3 | 2+2 | 2+2 |
+------+-----------+------------+
3 rows in set (0.00 sec)
You wanted c++
to be a separate word, but not 2+2
, didn’t you?
Right, so what can we do?
To treat c++
special way you can make it an exception.
Exceptions
So, exceptions
(also known as synonyms) allow to map one or more terms (including terms with characters that would normally be excluded) to a single keyword.
Let’s make c++
an exception by putting it into an exceptions file:
➜ ~ cat /tmp/exceptions
c++ => c++
and using the file when we create the table:
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | what | what |
| 8 | is | is |
| 9 | 2 | 2 |
| 10 | 2 | 2 |
+------+-----------+------------+
10 rows in set (0.01 sec)
Hooray, c++
is now a separate word and the plus signs are not lost, and all is ok with 2+2 too.
What you need to remember about the exceptions is that exceptions are very dumb, not smart at all, they do exactly what you ask them to do and nothing more. In particular:
- they don’t change the case
- if you make a mistake and put double space they don’t convert it into a single space
and so on. They literally consider your input as an array of bytes.
For example, people write c++
both in lower and upper case. Let’s try the above exception with the upper case?
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c | c |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | what | what |
| 8 | is | is |
| 9 | 2 | 2 |
| 10 | 2 | 2 |
+------+-----------+------------+
10 rows in set (0.00 sec)
Oops, C++
was tokenized as just c
, because the exception is c++
(lower case), not C++
(upper case).
But did you notice the exception constitutes a pair of items, not a single one: c++ => c++
. The left part is what triggers the exceptions algorithm in the text, the right part is a resulting token. Let’s try to add mapping of C++
to c++
?
➜ ~ cat /tmp/exceptions
c++ => c++
C++ => c++
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | how | how |
| 8 | about | about |
| 9 | c++ | c++ |
+------+-----------+------------+
9 rows in set (0.00 sec)
Alright, now it’s fine again, since both C++
and c++
are tokenized into token c++
. So satisfying.
What are the other good examples of the exceptions:
-
AT&T => AT&T
andat&t => AT&T
. -
M&M's => M&M's
andm&m's => M&M's
andM&m's => M&M's
-
U.S.A. => USA
andUS => USA
What are the bad examples?
-
us => USA
, because we don’t want eachus
becomeUSA
.
So the rule of thumb with the exceptions is:
Tip:
If a term includes special characters and that’s how it’s normally written in text and in a search query - make it an exception.
Synonyms
Manticore Search users also often call exceptions
synonyms, because another use case for them is not to just retain special character and letter case, but to map terms written absolutely differently to the same token, for example:
MS Windows => ms windows
Microsoft Windows => ms windows
Why is this important? Because it enables to easily find documents with Microsoft Windows
by MS Windows
and vice versa.
Example:
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems');
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976139 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
So at first glance it works fine, but thinking further about it and recalling the exceptions are case and byte sensitive you can ask yourself: “Can’t people write MicroSoft windows
, MS WINDOWS
, microsoft Windows
and so on?”
Yes, they can. So if you want to use the exceptions for that be ready for what’s called in mathematics a combinatorial explosion.
It looks no good at all. What can we do about it?
Wordforms
Another tool which is similar to the exceptions is wordforms
. Unlike the exceptions, the word forms are applied after tokenizing incoming text. So they are:
- case insenstitive (unless your
charset_table
enables case sensitivity) - don’t care about special characters
They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form. For example, to normalize all the variants such as “walks”, “walked”, “walking” to the normal form “walk”:
➜ ~ cat /tmp/wordforms
walks => walk
walked => walk
walking => walk
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('walks _WaLkeD! walking', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | walks | walk |
| 2 | walked | walk |
| 3 | walking | walk |
+------+-----------+------------+
3 rows in set (0.00 sec)
As you can see all the 3 words were converted to just walk
and, note, the 2nd word _WaLkeD
! even being very deformed was also normalized fine. Do you see where I’m going with this? Yes, the MS Windows
example. Let’s test if the wordforms can be useful to solve that issue.
Let’s put just 2 lines to the wordforms file:
➜ ~ cat /tmp/wordforms
ms windows => ms windows
microsoft windows => ms windows
and populate the table with a few documents:
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down');
mysql> select * from t;
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows |
| 1514841286668976168 | Windows are rolled down |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
Let’s now try different queries:
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
✅MS Windows
finds Microsoft Windows
fine.
mysql> select * from t where match('ms WINDOWS');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.01 sec)
✅ ms WINDOWS
works fine too.
mysql> select * from t where match('mIcRoSoFt WiNdOwS');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
✅ And even mIcRoSoFt WiNdOwS
finds the same document.
mysql> select * from t where match('windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows |
| 1514841286668976168 | Windows are rolled down |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
✅ Just basic windows
finds all the documents.
So indeed, wordforms
helps to solve the issue.
The rule of thumb with the wordforms is:
Tip:
Use wordforms for words and phrases that can be written in different forms and don’t contain special characters.
Floor & Decor
Let’s take a look at another example: we want to improve search for the brand name Floor & Decor
. We can assume people can write this name in the following forms:
Floor & Decor
Floor & decor
floor & decor
Floor and Decor
floor and decor
and other letter capitalization combinations.
Also:
Floor & Decor Holdings
Floor & Decor Holdings, inc.
and, again, various combinations with different letter capitalized.
Now that we know how exceptions
and wordforms
work what do we do to cover this brand name?
First of all we can easily notice that the canonical brand name is Floor & Decor
, i.e. it includes a special character which is normally considered a word separator, so should we use exceptions
? But the name is long and can be written in many ways. If we use exceptions
we can end up with a huge list of all the combinations. Moreover there are extended forms Floor & Decor Holdings
and Floor & Decor Holdings, inc.
which can make the list even longer.
The most optimal solution in this case is to just use wordforms
like this:
➜ ~ cat /tmp/wordforms
floor & decor => fnd
floor and decor => fnd
floor & decor holdings => fnd
floor and decor holdings => fnd
floor & decor holdings inc => fnd
floor and decor holdings inc => fnd
Why does it include &
? Actually you can skip it:
floor decor => fnd
floor and decor => fnd
floor decor holdings => fnd
floor and decor holdings => fnd
floor decor holdings inc => fnd
floor and decor holdings inc => fnd
because wordforms
anyway ignores non-word characters, but just for the sake of ease of reading it was left.
As a result you’ll get each combination tokenized as fnd
which will be our shortkey for this brand name.
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('Floor & Decor', 't')
+------+-------------+------------+
| qpos | tokenized | normalized |
+------+-------------+------------+
| 1 | floor decor | fnd |
+------+-------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('floor and Decor', 't')
+------+-----------------+------------+
| qpos | tokenized | normalized |
+------+-----------------+------------+
| 1 | floor and decor | fnd |
+------+-----------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('Floor & Decor holdings', 't')
+------+----------------------+------------+
| qpos | tokenized | normalized |
+------+----------------------+------------+
| 1 | floor decor holdings | fnd |
+------+----------------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('Floor & Decor HOLDINGS INC.', 't')
+------+--------------------------+------------+
| qpos | tokenized | normalized |
+------+--------------------------+------------+
| 1 | floor decor holdings inc | fnd |
+------+--------------------------+------------+
1 row in set (0.00 sec)
Is this the perfect ultimate solution? Unfortunately not as many other things in the area of full-text search. There are always rare cases and in this case too. For example:
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');
+---------------------+---------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------+
| 1514841286668976231 | It's located on the 2nd floor. Decor is also nice |
+---------------------+---------------------------------------------------+
1 row in set (0.00 sec)
We can see here that Floor & Decor
Holdings finds the document which has floor
in the end of the first sentence and the following one starts with Decor
. This happens because floor
. Decor
also gets tokenized to fnd
since we use just wordforms
that are insensitive to letter case and special characters:
mysql> call keywords('floor. Decor', 't');
+------+-------------+------------+
| qpos | tokenized | normalized |
+------+-------------+------------+
| 1 | floor decor | fnd |
+------+-------------+------------+
1 row in set (0.00 sec)
The false match is not good. To solve this particular problem we can use Manticore’s functionality to detect sentences and paragraphs.
Now if we enable it we can see that the document is not a match for the keyword any more:
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');
Empty set (0.00 sec)
because:
-
Floor & Decor
, as we remember is converted intofnd
bywordforms
-
index_sp='1'
splits text into sentences - after splitting
floor
. andDecor
end up in different sentences - and do not match
fnd
and therefore all the original forms of it anymore
Conclusion
Manticore’s exceptions and wordforms are powerful tools that can help you fine-tune your search, in particular improve recall and precision when it comes to short terms with special characters that should be retained and longer terms that should be aliased one to another. But you need to help Manticore do it, since it can’t decide what the names should be for you.
Thank you for reading this article!
References:
Documentation about exceptions
- https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Exceptions
Documentation about wordforms
- https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Wordforms
Top comments (0)