DEV Community

Cover image for Full Text Search in MySQL
Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

Full Text Search in MySQL

Table of Contents

Intro

Full text search means looking for given phrase anywhere in given text. This article will guide you how to do it efficiently in MySQL and save you from falling into many traps along the way.

Business case

Let's assume we have address book of our customers and the goal is to quickly find person by piece of his/hers name or email.

CREATE TABLE `address_book` (
    `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name` VARCHAR(128) NOT NULL,
    `email` VARCHAR(128) NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB CHARSET=utf8mb4;
Enter fullscreen mode Exit fullscreen mode

We will populate address book with 1_000_000 randomly generated people. Each person will be inserted in separate query. Name will be always in tidy form - first name and last name. Email will be more chaotic - with different first/last name order and presence, different separators and with some random numbers.

> SELECT `name`, `email` FROM `addressbook` LIMIT 8;

+--------------------+---------------------------------+
| name               | email                           |
+--------------------+---------------------------------+
| Reed Slavik        | 664-slavik-reed@example.com     |
| Reilly Isaacson    | reilly972isaacson@example.com   |
| Theodore Klosinski | 942.klosinski@example.com       |
| Duncan Sinke       | 912.duncan@example.com          |
| Maranda Cabrara    | cabrara-809-maranda@example.com |
| Hugh Harrop        | hugh765@example.com             |
| Bernard Luetzow    | bernard887luetzow@example.com   |
| Niki Manesis       | niki-247@example.com            |
+--------------------+---------------------------------+
Enter fullscreen mode Exit fullscreen mode

Tests will be performed on stock MySQL 8.0.32 Docker image with default configuration (except when said otherwise). Hardware is AMD 6800U, 32GB RAM, PCIe NVMe 4.0 x4 SSD. Operating system is vanilla Arch Linux with BTRFS and LUKS disk encryption.

No free lunches

There is no such thing as a free lunch. Indexes speed up SELECT but slow down INSERT/UPDATE/DELETE statements because of additional CPU cost to calculate and additional disk transfer and space cost to store. I'll try to write short summary when to use each method, what are the benefits and drawbacks.

Without indexes

The most naive approach is to have no indexed columns and to use LIKE '%john%' syntax.

Because there are no indexes to maintain this method does not increase data loading time and storage space.

$ time cat address_book.sql | mysql

real    23m 31.43s
Enter fullscreen mode Exit fullscreen mode
> SELECT data_length, index_length FROM information_schema.tables WHERE table_name = 'address_book';
+-------------+--------------+
| DATA_LENGTH | INDEX_LENGTH |
+-------------+--------------+
|    71942144 |            0 |
+-------------+--------------+
Enter fullscreen mode Exit fullscreen mode

Performance is poor. When no index is used MySQL uses Turbo Boyer-Moore algorithm to find matching rows.

> SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%';

+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.222 sec)
Enter fullscreen mode Exit fullscreen mode

All rows needs to be picked up from disk for analysis as shown by query EXPLAIN.

> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%'\G

           id: 1
  select_type: SIMPLE
        table: address_book
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 996458
     filtered: 1.23
        Extra: Using where
Enter fullscreen mode Exit fullscreen mode

Use: When your application does full text search rarely and you are willing to accept low query performance. Works really well on small data set. Simple implementation is big bonus.

Avoid: When full text search is frequently used - you will burn a lot of database performance here, especially on large data set. Also because of full rows scan it may block other queries in your application that need FOR UPDATE locks on such table.

Using B-tree indexes

Slapping an index on a field and calling it a day unfortunately will not work here. In B-tree index text is converted to series of binary (true/false) tests tree from start to end of searched phrase. For sample data:

1 John
2 Joseph
3 Joseph
4 Ann
Enter fullscreen mode Exit fullscreen mode

It will look something like this.

                   <="a"?
                    /  \
                  yes   no
                  /       \
             <="nn"?     <="jo"
               /          /
             yes        yes
             /          /
           [4]      <="h"?
                     /  \
                   yes   no
                   /      \
                <="n"?    <="seph"?
                 /          /
               yes        yes
               /          /
             [1]        [2,3]
Enter fullscreen mode Exit fullscreen mode

If you are looking for Joseph you test first character. Because j>a you go through no path. Then you test first two characters. Because jo=jo you trim them from phrase and go through yes path. Then you test next unmatched character that is h... You continue to perform those series of tests until you finally reach list of rows that have phrase you are looking for, in this case 2 and 3. But that shows this type of index must work from start to end of phrase, meaning phrase cannot start with wildcard.

Let's add it to our table.

> ALTER TABLE `address_book` ADD KEY (`name`), ADD KEY (`email`);
Enter fullscreen mode Exit fullscreen mode

As you can see when searched phrase starts with wildcard index will not be used.

> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%'\G

           id: 1
  select_type: SIMPLE
        table: address_book
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 996458
     filtered: 1.23
        Extra: Using where
Enter fullscreen mode Exit fullscreen mode

If you know that text has some specific structure (in our case name goes first) we can leverage this knowledge and ask for name without starting wildcard.

> SELECT * FROM `address_book` WHERE `name` LIKE 'john%' AND `name` LIKE '%doe%';
+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.003 sec)
Enter fullscreen mode Exit fullscreen mode

Explain shows that this time index is used, all names starting with john are found within index and Boyer-Moore must be only used for fine filtering this set against doe.

> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE 'john%' AND `name` LIKE '%doe%'\G

           id: 1
  select_type: SIMPLE
        table: address_book
   partitions: NULL
         type: range
possible_keys: name
          key: name
      key_len: 514
          ref: NULL
         rows: 3602
     filtered: 100.00
        Extra: Using index condition
Enter fullscreen mode Exit fullscreen mode

This approach quickly shows limitation when it comes to email. It is too chaotic - may start with first name, may start with last name, may even start with something completely different. In this case query time will be like in without index case.

> SELECT * FROM `address_book` WHERE `email` LIKE '%john%' AND `email` LIKE '%doe%';

+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.314 sec)
Enter fullscreen mode Exit fullscreen mode

Performance wise it slows down data loading a bit and doubles storage space without being very useful.

$ time cat address_book.sql | mysql

real    24m 12.81s
Enter fullscreen mode Exit fullscreen mode
> SELECT data_length, index_length FROM information_schema.tables WHERE table_name = 'address_book';
+-------------+--------------+
| DATA_LENGTH | INDEX_LENGTH |
+-------------+--------------+
|    71942144 |    112623616 |
+-------------+--------------+
Enter fullscreen mode Exit fullscreen mode

Use: When you can split text into well defined columns with their own indexes. Like for example restructure table to store first_name and last_name separately. Also you must be willing to sacrifice starting wildcard.

Avoid: When text is too unpredictable and unordered, like for example email or name of various products in your store.

Note: Right to left languages are no exception, searched phrase cannot start with wildcard no matter what direction text is written in.

Introducing reverse indexes

First let's explain what reverse index is. B-tree index was series of tests on searched phrase from start to end. Reverse index takes a different approach and it creates tokens from words. Token can be whole word or n-gram (substring of given length from word, for Johnie 3 letter n-grams are: joh, ohn, hni, nie).

And that allows to build index in slightly different way. For sample data:

1 Paul
2 Roland
3 Carol
Enter fullscreen mode Exit fullscreen mode

Index of 3 letter n-gram tokens will look like:

pau => [p1r1] # that means this n-gram is at position 1 in row 1
aul => [p2r1]
rol => [p1r2,p3r3]
ola => [p2r2]
lan => [p3r2]
and => [p4r2]
car => [p1r3]
aro => [p2r3]
Enter fullscreen mode Exit fullscreen mode

Now if we look for rol we immediately know that this token is present in rows 2 and 3. If we search for longer phrase like roland database may use this index twice - if rol is found on some position then and must be found 3 characters later. Only row 2 matches this condition.

Using reverse indexes with default parser

Reverse indexes have its own syntax, let's add one to our table.

ALTER TABLE `address_book` ADD FULLTEXT (`name`), ADD FULLTEXT(`email`);
Enter fullscreen mode Exit fullscreen mode

Default tokenizer uses word boundaries to find tokens, meaning one continous word is one token.

To utilize full text index MATCH () AGAINST () syntax must be used. AGAINST section can work in NATURAL LANGUAGE MODE where searched text is also tokenized or in more useful BOOLEAN mode that contains its own powerful mini-expression-language. I won't dive too deeply into BOOLEAN MODE syntax, basically + means AND.

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE);

+--------+---------------+------------------------------+
| id     | name          | email                        |
+--------+---------------+------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
+--------+---------------+------------------------------+
1 row in set (0.001 sec)

> SELECT * FROM `address_book` WHERE MATCH (`email`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE);

+--------+---------------+------------------------------+
| id     | name          | email                        |
+--------+---------------+------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
+--------+---------------+------------------------------+
1 row in set (0.001 sec)
Enter fullscreen mode Exit fullscreen mode

Whoa, that is fast. Over 200x faster than no indexes approach. We are not restricted to search from beginning of the phrase like in B-tree index, meaning searching in email works fast too. Our index filters rows great according to EXPLAIN.

> EXPLAIN SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE)\G

           id: 1
  select_type: SIMPLE
        table: address_book
   partitions: NULL
         type: fulltext
possible_keys: name
          key: name
      key_len: 0
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Ft_hints: no_ranking
Enter fullscreen mode Exit fullscreen mode

Life is great. Or is it?

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);

Empty set (0.002 sec)
Enter fullscreen mode Exit fullscreen mode

First trap! You cannot find phrase shorter than token length and by default whole words are tokens. This is balance between search speed and index build / storage cost. And we are already taking significant penalties in those areas.

$ time cat address_book.sql | mysql
real    29m 34.44s

# du -bc /var/lib/mysql/default/fts_*
492453888       total
Enter fullscreen mode Exit fullscreen mode

That is 126% of unindexed loading time and fulltext index alone is taking 7 times more than data itself. Note that there is no easy way to check fulltext index size from INFORMATION_SCHEMA, it has to be done on MySQL server filesystem.

Use: When you want to search by whole words. Boolean mode expression allows to do some cool tricks like exclude some words or find by relevance which you may find useful. But you must be willing to accept higher write times and a lot higher storage costs.

Using reverse indexes with n-gram parser

This time each word will be split into n-grams. Default length of n-gram is defined in server configuration variable:

> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
Enter fullscreen mode Exit fullscreen mode

Index creation syntax must define tokenizer (named here as "parser") explicitly.

ALTER TABLE `address_book` ADD FULLTEXT (`name`) WITH PARSER ngram, ADD FULLTEXT(`email`) WITH PARSER ngram;
Enter fullscreen mode Exit fullscreen mode

This time rows are found as expected, even if not whole words are used in search .

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.266 sec)
Enter fullscreen mode Exit fullscreen mode

But what about this horrible performance? This is slower than without indexes! The answer lies in n-gram size. If matching phrase does not match n-gram size then database must query index few times and merge results or do supplementary unindexed filtering. Let's restart our server with --ngram_token_size=3 and rebuild table.

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.087 sec)
Enter fullscreen mode Exit fullscreen mode

So in this case doe was matching token size and index was used directly but john had to be found inderectly in this index. This is even more obvious if we ask for COUNT.

> SELECT COUNT(*) FROM `address_book` WHERE MATCH (`email`) AGAINST ('+john' IN BOOLEAN MODE);

+----------+
| COUNT(*) |
+----------+
|     3563 |
+----------+
1 row in set (0.064 sec)   # phrase longer than token

> SELECT COUNT(*) FROM `address_book` WHERE MATCH (`email`) AGAINST ('+doe' IN BOOLEAN MODE);

+----------+
| COUNT(*) |
+----------+
|      431 |
+----------+
1 row in set (0.003 sec)    # phrase equal to token
Enter fullscreen mode Exit fullscreen mode

So we sacrificed ability to search by 2 characters using index, gained a lot of boost when searching by 3 characters, gained mediocre boost in other cases.

Using this method is a pile of tradeoffs. No, you cannot have few indexes on the same field with different n-gram sizes to address various search phrase lengths. It get's worse - configuration variable is global so you cannot even have two FULLTEXT indexes on different tables with different n-gram sizes. One config must fit all your needs server-wide.

How about write performance and storage penalty?

$ time cat address_book.sql | mysql
real    26m 31.05s

# du -bc /var/lib/mysql/default/fts_*
362430464       total
Enter fullscreen mode Exit fullscreen mode

Unfortunately they are big, index takes 5 times more space than data.

Use: When you want to search by parts of words. Boolean mode expression also works here. But first you must find the right server-wide balance of token length and accept higher write times and a lot higher storage costs. Phrases of length different than token size will still be faster that unindexed approach but without "wow" factor.

Avoid: When your text uses ideographic language (like Chinese or Japanese) and requires single character tokens. There is separate MeCab tokenizer for Japanese but that is beyond scope of this article.

InnoDB reverse index performance degradation

Let's use data from previous chapter and delete all rows.

> DELETE FROM `address_book`;

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);

Empty set (0.233 sec)
Enter fullscreen mode Exit fullscreen mode

So time was 0.087s for table with data but now is 0.233s for empty table? That is because when row is deleted from InnoDB table it is not deleted from FULLTEXT index. Instead separate hidden table tracks removed rows and search in outdated index must compare outdated results on 1_000_000 rows with list of deleted 1_000_000 rows. This gets progressively worse. Let's add, remove, add, remove and add our data. So we are back at 1_000_000 original rows in table. The same amount of rows that we started with.

> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);

+--------+----------------+-------------------------------+
| id     | name           | email                         |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel  | doemel.36.johnie@example.com  |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (7.038 sec)
Enter fullscreen mode Exit fullscreen mode

That escalated quickly... Now it is time to enter very trippy land. To rebuild InnoDB FULLTEXT index and regain performance you must alter whole table. That requires extensive database user priviledges and is very likely to cause application downtime. But fear not. There is global innodb_optimize_fulltext_only=ON flag that globally(!) changes ALTER/OPTIMIZE (in InnoDB those are synonyms) to only purge old entries from FULLTEXT index. You can configure how many tokens are purged by setting innodb_ft_num_word_optimize flag, maximum value is 10_000. And there is no feedback if you are done. I repeat once again - there is no feedback if you are done, you are supposed to run consecutive ALTERs hoping that at some point your FULLTEXT index will have no outdated entries.

That is garbage UI design.

The cure is worse than the disease. MyISAM purges FULLTEXT index on the fly, it does not degrade on data retention. So you may convert your InnoDB table to MyISAM losing all InnoDB goodies. Or you can build supplementary MyISAM table like address_book_fts, have FULLTEXT indexes there and synchronize data from InnoDB table using triggers. And when you think you are vicorious - GTID consistency kicks in. If you are using GTID transation identifiers in replication you cannot update InnoDB and MyISAM table in the same transaction, meaning you must risk autocommit writes in your flow. Yuck.

Alternatives

I hope now you better understand MySQL capabilities regarding full text search. There are tradeoffs and there are traps. If you haven't found solution matching your needs I recommend:

  • Switching to PostgreSQL. I try to be objective, but full text search in MySQL is some weird, unfinished patchwork in my experience. PostgreSQL solution is much better and maybe I'll write follow-up to this article but with using Postgres.
  • Staying in MySQL ecosystem but using Sphinx plugin instead of built-in solutions.
  • Outsorcing fulltext search to ElasticSearch. That of course comes with whole new set of challenges related to data synchronization, but ES capabilities are awesome.

Outro

Thanks for reading. If you succcessfully implemented full text search in your application please share info about used technology and encountered issues in comment.

Top comments (0)