MySQL – Full-Text Search with InnoDB

Full-Text Search with InnoDB.

Full-Text Search with InnoDB

MySQL’s latest InnoDB engine can now do extensive, high-performance, full text search. A quick primer delivers all the goodies.

Oracle recently provided access to many new MySQL 5.6 features through http://labs.mysql.com for the user community to test and comment on. One notable feature is the InnoDB Full-Text Search (FTS) engine. It lets users build FULLTEXT indexes on InnoDB tables to represent text-based content and speed up searches for words and phrases. Combining full-text search with InnoDB tables opens up text capability to transactional applications, where the textual data is frequently inserted, updated, and deleted. Given the importance of this feature to the MySQL audience, this article explains the design of InnoDB FTS and provides recipes for its use, as well as a short list of the limitations of this release.

The Design of Full Text Search

Like most database-powered search features, InnoDB Full-Text Search is designed as an “inverted index,” in which the incoming text strings are tokenized into individual words, and these words are stored in one or more auxiliary tables. For each word, a list of Document IDs and word position pairs is stored. We call such a Document ID/Position pair list an “ilist.” Thus, in the index table, two important columns are word and ilist. InnoDB FTS supports Proximity Search, which has been lacking in the MyISAM FTS feature, by storing the position (as a byte offset) for each word.

InnoDB FTS represents each search index as six tables. The words are divided among these tables based on their first characters. Currently, the partition criteria are hard coded, targeted to the Latin character set.

The partitioned design helps to parallelize the operation. In the current release, InnoDB only parallelizes the CREATE INDEX operation, but it can parallelize other operations such as query and full-text search later.

Let’s move to an example to illustrate the auxiliary tables with an InnoDB FTS index.

Starting in MySQL 5.6, you can examine InnoDB metadata (the InnoDB system tables) through corresponding tables in the INFORMATION_SCHEMA database. Let’s see how these auxiliary tables are created alongside the FTS index. These tables are specific to InnoDB, so you will not find them in MySQL’s metadata, and you cannot query them through normal SQL interfaces.

Here is the definition of the quotes table used for our example:

  create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    
    , quote varchar(400)
    , source varchar(64)
  ) engine=innodb;

-- Create the fulltext index
create fulltext index idx on quotes(quote);

-- Now check tables in the InnoDB SYS_TABLES:

mysql> SELECT table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES;

This generates:

table_id name
548 test/FTS_0000000000000223_0000000000000257_INDEX_1
549 test/FTS_0000000000000223_0000000000000257_INDEX_2
550 test/FTS_0000000000000223_0000000000000257_INDEX_3
547 test/quotes

19 rows in set (0.01 sec)

This output shows that there are multiple auxiliary tables with names prefixed with FTS_ and postfixed with INDEX_*. They are all inverted index tables associated with table test/quotes through its table_id column. The table test/quotes has a table_id value of 547, whose hex representation is 0x223, and “223” is embedded in the names of all the auxiliary tables associated with the quotes table. Another hex value in the auxiliary table’s name is the FULLTEXT index ID. For example, in the aforementioned example, for the table test/FTS_0000000000000223_0000000000000257_INDEX_3, the hex value 0x223 (547) means it is an auxiliary index table for table quotes, and its index ID is 0x257 (599):

The index information can be retrieved through INFORMATION_SCHEMA.INNODB_SYS_INDEXES, using SELECT index_id, table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE table_id=547;

generating:

index_id table_id name
598 547 PRIMARY
599 547 idx
600 547 FTS_DOC_ID_INDEX

3 rows in set (0.01 sec)

Index idx is the FTS index we just created. There is also one index FTS_DOC_ID_INDEX on the table. This is the index on the FTS_DOC_ID column (which we will discuss later), and it is created along with the FULLTEXT index if it does not exist already.

Batching the Insert Value with the InnoDB FTS Index Cache

The InnoDB FULLTEXT index is represented on disk as a set of auxiliary tables, each consisting of columns word and ilist. InnoDB has an additional in-memory structure, the FTS Index Cache, which batches the inserted values before they are synchronized to the auxiliary index tables on disk. It acts like the InnoDB change buffer to save I/O during intensive DML operations.

The FTS Index Cache represents the same word values as the FULLTEXT index, using a red-black tree data structure. By caching the words from the inserted documents, we can avoid reading from disk while doing the initial filtering phase of a search, then merge (via union) the in-memory word list with the document IDs and positional data from the FTS index tables when producing the result set.

The buffering by the FTS Index Cache avoids frequent updates to the underlying FTS index tables during busy inserts and updates; the FULLTEXT index only needs to be synchronized when the index cache is full. The batching technique also minimizes the number of entries for each word in the FULLTEXT index. Instead of flushing each word with a single ilist entry (with a single Document ID), it combines results from many inserted documents, creating an ilist with multiple Document ID/Position pairs, and storing this information to disk as a single entry. This technique reduces redundant data, making the FULLTEXT index smaller.

The index cache size is configurable by the option innodb_ft_cache_size, with a default value of 32 MB. This option will be discussed in greater detail with the specifics of DML operations later.

InnoDB FTS Document ID

Another important concept is Document ID management. As with most FTS engines, the mapping from word to record goes through a unique ID. In our case, it is represented by the FTS_DOC_ID (uppercase required) column. If this column is not defined, InnoDB automatically adds it to the user table when creating the FULLTEXT index. The column itself must be of BIGINT UNSIGNED NOT NULL type, with a unique index named FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the FULLTEXT index after loading data, since InnoDB does not need to reorganize the table (as happens when a new column is added). In such a case, you must manage the Document ID column to avoid empty or duplicate values.

Using the InnoDB FTS

New features often involve new syntax or changes to the existing syntax. However, this InnoDB feature is compatible with the existing MySQL syntax for full-text search using MyISAM tables. If you are familiar with the existing MyISAM FTS feature, you can quickly start using the equivalent InnoDB full-text search. However, there are some recipes and tips. (After these, we discuss a few limitations of this release.)

Create an FTS Index on an InnoDB Table

For creating a full-text index, we recommend loading the table with data first, then creating the FULLTEXT index. This is much faster than creating the table with an empty FULLTEXT index, then inserting documents into it.

use test;

drop table if exists quotes;
-- The InnoDB full-text search feature in the 5.6 Labs release
-- lets us define a full-text index on an InnoDB table.

create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    
    , quote varchar(400)
    , source varchar(64)
  ) engine=innodb;

-- Insert some words and phrases to search for into the table.
insert into quotes (author, quote, source) values
  ('Abraham Lincoln', 'Fourscore and seven years ago...',
  'Gettysburg Address')
, ('George Harrison', 'All those years ago...', 'Live In Japan')
, ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.',
  '2010: The Year We Make Contact')
, ('Benjamin Franklin',
  'Early to bed and early to rise, makes a man healthy, wealthy, and wise.',
  'Poor Richard''s Almanack')
, ('James Thurber',
  'Early to rise and early to bed makes a male healthy and wealthy and dead.',
  'The New Yorker')
, ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.', 'Men in Black');

-- Create the fulltext index
create fulltext index idx on quotes(quote);

-- You can also create fulltext index at create table time
-- create table quotes
--  (    id int unsigned auto_increment primary key
--    , author varchar(64)    
--    , quote varchar(400)
--    , source varchar(64)
--    , fulltext (quote)
--  ) engine=innodb;

InnoDB can create the index in parallel through InnoDB’s Fast Index Creation (FIC) feature, which supports parallel sorting. This technique actually allows you to tokenize, sort, and create an InnoDB FULLTEXT index in parallel.

To control the parallel sort degree, you can use a new configuration variable innodb_ft_sort_pll_degree (default 2, maximum 32). This variable specifies how many ways to parallelize the tokenization and sort operations. Experiments show, on a non-I/O bound system, the create index performance scales well with the number of CPUs and innodb_ft_sort_pll_degree.

The following table shows a quick run on 2.7 GB of Wikipedia data on an 8-core Linux x86 machine:

Server Time (min)
MyISAM 11 min 47.90
InnoDB (default) 7 min 25.21 sec
InnoDB pll_degree 5 min 34.98
InnoDB pll_degree = 8 4 min 9 sec
InnoDB pll_degree = 16 3 min 39.51 sec

Each InnoDB table with a FULLTEXT index includes a column FTS_DOC_ID. Specifying this column in the original table definition saves a rebuild of the entire table when an InnoDB FULLTEXT index is created. To do so, add the column FTS_DOC_ID (all uppercase) to the table with the FTS index. The column must be of BIGINT UNSIGNED NOT NULL datatype. It does not need to be an auto-increment column, but auto_increment could make the loading easier:

CREATE TABLE fts_test (
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
title VARCHAR(200),
body TEXT) ENGINE=InnoDB;

-- The unique "FTS_DOC_ID_INDEX" index on FTS_DOC_ID is also optional,
-- without it, InnoDB will automatically create it.:
CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);

Then load the data and:

CREATE FULLTEXT INDEX idx on fts_test (title, body);

Please note, currently this special column must be named FTS_DOC_ID, and the name must be all uppercase. If you leave out this column, InnoDB adds FTS_DOC_ID as a hidden column automatically when the FULLTEXT index is created — and that is an expensive operation for an InnoDB table, so it is much more efficient to define the column yourself.

Query, Natural Language Search

Once the data is loaded and committed, you can run queries using the MATCH(columns) AGAINST (search expression)operator to conduct actual searches. You can combine this operator with the WHERE clause and similar clauses in the SELECT statement.

-- Search for a single word.
select author as "Monolith" from quotes
  where match(quote) against ('monolith' in natural language mode);
Monolith
Arthur C. Clarke

1 row in set (0.01 sec)

-- The default minimum length is 3 rather than 4, and the search
-- returns words that appear in a high proportion of the table rows.
select author as "Ago" from quotes
  where match(quote) against ('ago' in natural language mode);
Ago
Abraham Lincoln
George Harrison
Arthur C. Clarke
K

4 rows in set (0.00 sec)

Query, Boolean Search

For more complicated searches, you can have multiple words and phrases, and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.

-- Search for a combination of words, not in the same order as the original.
select author as "Ago and Years" from quotes
  where match(quote) against ('+ago +years' in boolean mode);

Resulting in:

Ago and Years
Abraham Lincoln
George Harrison
Arthur C. Clarke
K

4 rows in set (0.00 sec)

Proximity Search

Proximity search is a new feature in InnoDB full-text search. It is a special case of Boolean search using the @ operator within the AGAINST() string. You supply two or more words, double-quoted, within the single-quoted AGAINST() string, followed by @distance to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.

-- The starting points for these words are too far apart
-- (not within 20 bytes), so no results.
select quote as "Too Far Apart" from quotes
  where match(quote) against ('"early wise" @20' in boolean mode);
Empty set (0.00 sec)

-- But the starting points of all words are within 100 bytes,
-- so this query does give results.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @100' in boolean mode);

This results in:

Early…Wise
Early to bed and early to rise, makes a man healthy, wealthy, and wise.

1 row in set (0.00 sec)

Transactions

One of the key ideas behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions. You can design tables with both full-text columns and other columns. Multiple sessions can update the full-text column data (and other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.

One thing to note is that, just like the search feature in most transactional databases (such as Oracle Text for the Oracle Database), the tokenization for inserted strings is performed only at commit time, so a full-text search does not see the uncommitted data.

create table quotes_uncommitted
  (
      author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
    , primary key (author, quote(128))
  );

-- Insert but don't immediately commit.
insert into quotes_uncommitted select author, quote, source from quotes;

-- Within the same transaction, a full-text search does not see the uncommitted data.
select count(author), author as "Uncommitted Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Which gives us:

count(author) Uncommitted Results
0 NULL

1 row in set (0.00 sec)

-- OK, let's start with some committed data in the table, then empty the table,
-- then try some FTS queries, both before and after the commit.
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
delete from quotes_uncommitted;
select count(author), author as "Deleted but still not committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Resulting in:

count(author) Deleted but still not committed
0 NULL

1 row in set (0.00 sec)

rollback;
select count(author), author as "Deleted and rolled back" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Which results in:

count(author) Deleted and rolled back
4 Abraham Lincoln

1 row in set (0.00 sec)

delete from quotes_uncommitted;
commit;
select count(author), author as "Deleted and committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
count(author) Deleted and committed
0 NULL

1 row in set (0.00 sec)

insert into quotes_uncommitted select author, quote, source from quotes;
commit;
truncate table quotes_uncommitted;
select count(author), author as "Truncated" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

With the result of:

count(author) Truncated
0 NULL

1 row in set (0.00 sec)

DML (Insert, Update, and Delete)

As mentioned in the last section, the inserting string processing (tokenization) is performed only at commit time. This behavior is consistent with FTS in most transactional database systems, since the DMLs on tables with full-text search indexes are costly operations. Typically, you synchronize the results of DML operations to the FTS index periodically rather than in real time.

A typical example is the Oracle Text component of Oracle Database, which synchronizes table data with the search index periodically or manually. Starting in Oracle Database 11g, it now allows users to specify when data updates are reflected in the full-text index — manually, on commit, or at a regular interval.

For InnoDB Full-Text Search, the FTS index is updated at transaction commit time, when all the document tokenization and inserting into the FTS index (or associated cache) happens. All other DML operations still follow their usual rules about commit/rollback and isolation levels.

  • Insert. The inserted document is tokenized at commit time and inserted into the FTS Index Cache. This cache has a configurable size (innodb_ft_cache_size) with a default of 32 MB. Once this cache is full, it is synchronized to the on-disk tables that represent the index. During normal server shutdown, any content in the FTS Index Cache is synchronized to the FULLTEXT index on disk. If there is a server crash and the content of FTS Index Cache was not saved to disk, after the server reboot, the data is made consistent when you first query or insert into the FULLTEXT index. Any documents missing from the FULLTEXT index are read from the original table, retokenized, and added to the FTS Index Cache.
  • Delete. When you delete rows from a table containing an InnoDB FULLTEXT index, InnoDB does not delete the corresponding word entries in the FTS auxiliary tables. At commit time, InnoDB records the deleted Document IDs in another auxiliary table named DELETED. For each query, InnoDB consults this DELETED table to filter out any deleted documents. This design makes the delete operation simple and fast, without the need to update thousands of word entries for each deleted document. Since the word entries are not removed from the FULLTEXT index automatically, you need to perform index optimization periodically (as described in the next section) to keep the index size reasonable.
  • Update. For updates affecting any FTS-indexed columns, the updates are performed as a combination of Delete + Insert operations. In-place update happens only if the update affects none of the columns referenced by the FULLTEXT index.

Index Optimization

As just discussed, if there is substantial update and delete activity on an InnoDB FULLTEXT index, the index could become bloated, as InnoDB logs the deleted Document IDs in the DELETED auxiliary table. Over time, the FULLTEXT index could grow bigger despite rows being removed from the original table. To resolve this, you can optimize the index. This operation does two things: it removes the deleted Document ID from the word’s Document ID/Position pair list (ilist), and it consolidates multiple entries for the same word to one entry if possible by consolidating their Document ID/Position pair list (ilist).

Currently, MySQL runs the optimization operation only as part of the OPTIMIZE TABLE command, when the innodb_optimize_fulltext_only configuration variable is enabled:

	mysql> set global innodb_optimize_fulltext_only=1;
	mysql> optimize table articles;

As the table could be large, and optimization could take a significant amount of time, you typically should do the optimization in stages. The configuration variable innodb_ft_num_word_optimize specifies how many words to optimize for each OPTIMIZE TABLE command. It defaults to 2000 words. When the next OPTIMIZE TABLE command is issued, the server continues the process from where it left off.

Stopword Handling

Stopwords are common or trivial words, such as “the” and “an,” which are omitted from the FULLTEXT index. They are potentially different for each language or application. InnoDB FTS provides two sources of stopwords:

  • MySQL server predefined stopwords. If no user stopword list is defined, this default list is used. You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
  • User-defined stopwords.You can define your own stopwords by creating a table with a single column named value, with datatype varchar, and pointing the global variable innodb_ft_server_stopword_table to this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. For example:
	# Define a correctly formatted user stopword table 
	create table user_stopword(value varchar(30)) engine = innodb;
	# Point to this stopword table with "db name/table name" 
	set global innodb_ft_server_stopword_table = "test/user_stopword";

Query Performance

Earlier in this article, we demonstrated ways to make creating an InnoDB FULLTEXT index faster. Preliminary results on a 2.7 GB Wikipedia data set are shown in the following table. The queries used include various combinations of include, exclude, wildcard, and proximity operators, even an intentional misspelling (“preassumtions”). They are identified by the number in the leftmost column:


1) select count(*) from wp where match(title,text) against ('Creationist +Abrahamic -creationism' in boolean mode);
2) select count(*) from wp where match(title,text) against ('Abrahamic');
3) select count(*) from wp where match(title,text) against ('preassumtions +orangutan' in boolean mode);
4) select count(*) from wp where match(title,text) against ('orangutan +falsified ~naturalistically' in boolean mode);
5) select count(*) from wp where match(title,text) against ('+india* -leader +gandh*' in boolean mode);
6) select count(*) from wp where match(title,text) against ('"american culture"@09′ in boolean mode);
Query InnoDB (ms) * InnoDB FTS Search Processing Time MyISAM (ms)
1 10 2 10
2 130 1 10
3 1 1 10
4 2 2 10
5 320 116 1000
6 (New InnoDB Feature) 5250 N/A

* FTS Search processing time is the actual time spent by InnoDB full-text index scan.

Limitations

As this version of FTS is fairly new, there are some aspects that have limitations that Oracle will remove in succedding versions. These include:

  • Ranking: Currently, the FTS uses a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. The more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked.
  • Stemming, which matches alternative forms such as singular, plural, past tense, and other forms derived from the same root word is not supported.
  • Ideographic languages, Chinese, Japanese and Korean (referred to as CJK), which do not have word delimiters, do not yet have support for N-GRAM parsing. (MyISAM FTS has a similar limitation.)
  • Single character set: Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation. (MyISAM FTS has a similar limitation.)

However, the InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents in transaction-intensive database applications. Subsequent releases will continue to increase its performance, usability, and feature set.

This entry was posted in MySQL, Uncategorized and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s