Full-Text Search with InnoDB
MySQL’s latest InnoDB engine can now do extensive, high-performance, full text search. A quick primer delivers all the goodies.
Oracle recently provided access to many new MySQL 5.6 features through http://labs.mysql.com for the user community to test and comment on. One notable feature is the InnoDB Full-Text Search (FTS) engine. It lets users build
FULLTEXT indexes on InnoDB tables to represent text-based content and speed up searches for words and phrases. Combining full-text search with InnoDB tables opens up text capability to transactional applications, where the textual data is frequently inserted, updated, and deleted. Given the importance of this feature to the MySQL audience, this article explains the design of InnoDB FTS and provides recipes for its use, as well as a short list of the limitations of this release.
The Design of Full Text Search
Like most database-powered search features, InnoDB Full-Text Search is designed as an “inverted index,” in which the incoming text strings are tokenized into individual words, and these words are stored in one or more auxiliary tables. For each word, a list of Document IDs and word position pairs is stored. We call such a Document ID/Position pair list an “ilist.” Thus, in the index table, two important columns are
ilist. InnoDB FTS supports Proximity Search, which has been lacking in the MyISAM FTS feature, by storing the position (as a byte offset) for each word.
InnoDB FTS represents each search index as six tables. The words are divided among these tables based on their first characters. Currently, the partition criteria are hard coded, targeted to the Latin character set.
The partitioned design helps to parallelize the operation. In the current release, InnoDB only parallelizes the
CREATE INDEX operation, but it can parallelize other operations such as query and full-text search later.
Let’s move to an example to illustrate the auxiliary tables with an InnoDB FTS index.
Starting in MySQL 5.6, you can examine InnoDB metadata (the InnoDB system tables) through corresponding tables in the
INFORMATION_SCHEMA database. Let’s see how these auxiliary tables are created alongside the FTS index. These tables are specific to InnoDB, so you will not find them in MySQL’s metadata, and you cannot query them through normal SQL interfaces.
Here is the definition of the
quotes table used for our example:
create table quotes ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(400) , source varchar(64) ) engine=innodb; -- Create the fulltext index create fulltext index idx on quotes(quote); -- Now check tables in the InnoDB SYS_TABLES: mysql> SELECT table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES;
19 rows in set (0.01 sec)
This output shows that there are multiple auxiliary tables with names prefixed with
FTS_ and postfixed with
INDEX_*. They are all inverted index tables associated with table
test/quotes through its
table_id column. The table
test/quotes has a
table_id value of 547, whose hex representation is 0x223, and “223” is embedded in the names of all the auxiliary tables associated with the
quotes table. Another hex value in the auxiliary table’s name is the
FULLTEXT index ID. For example, in the aforementioned example, for the table
test/FTS_0000000000000223_0000000000000257_INDEX_3, the hex value 0x223 (547) means it is an auxiliary index table for table
quotes, and its index ID is 0x257 (599):
The index information can be retrieved through
SELECT index_id, table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE table_id=547;
3 rows in set (0.01 sec)
idx is the FTS index we just created. There is also one index
FTS_DOC_ID_INDEX on the table. This is the index on the
FTS_DOC_ID column (which we will discuss later), and it is created along with the
FULLTEXT index if it does not exist already.
Batching the Insert Value with the InnoDB FTS Index Cache
FULLTEXT index is represented on disk as a set of auxiliary tables, each consisting of columns
ilist. InnoDB has an additional in-memory structure, the FTS Index Cache, which batches the inserted values before they are synchronized to the auxiliary index tables on disk. It acts like the InnoDB change buffer to save I/O during intensive DML operations.
The FTS Index Cache represents the same
word values as the
FULLTEXT index, using a red-black tree data structure. By caching the words from the inserted documents, we can avoid reading from disk while doing the initial filtering phase of a search, then merge (via union) the in-memory word list with the document IDs and positional data from the FTS index tables when producing the result set.
The buffering by the FTS Index Cache avoids frequent updates to the underlying FTS index tables during busy inserts and updates; the
FULLTEXT index only needs to be synchronized when the index cache is full. The batching technique also minimizes the number of entries for each word in the
FULLTEXT index. Instead of flushing each word with a single
ilist entry (with a single Document ID), it combines results from many inserted documents, creating an
ilist with multiple Document ID/Position pairs, and storing this information to disk as a single entry. This technique reduces redundant data, making the
FULLTEXT index smaller.
The index cache size is configurable by the option
innodb_ft_cache_size, with a default value of 32 MB. This option will be discussed in greater detail with the specifics of DML operations later.
InnoDB FTS Document ID
Another important concept is Document ID management. As with most FTS engines, the mapping from word to record goes through a unique ID. In our case, it is represented by the
FTS_DOC_ID (uppercase required) column. If this column is not defined, InnoDB automatically adds it to the user table when creating the
FULLTEXT index. The column itself must be of
BIGINT UNSIGNED NOT NULL type, with a unique index named
FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the
FULLTEXT index after loading data, since InnoDB does not need to reorganize the table (as happens when a new column is added). In such a case, you must manage the Document ID column to avoid empty or duplicate values.
Using the InnoDB FTS
New features often involve new syntax or changes to the existing syntax. However, this InnoDB feature is compatible with the existing MySQL syntax for full-text search using MyISAM tables. If you are familiar with the existing MyISAM FTS feature, you can quickly start using the equivalent InnoDB full-text search. However, there are some recipes and tips. (After these, we discuss a few limitations of this release.)
Create an FTS Index on an InnoDB Table
For creating a full-text index, we recommend loading the table with data first, then creating the
FULLTEXT index. This is much faster than creating the table with an empty
FULLTEXT index, then inserting documents into it.
use test; drop table if exists quotes; -- The InnoDB full-text search feature in the 5.6 Labs release -- lets us define a full-text index on an InnoDB table. create table quotes ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(400) , source varchar(64) ) engine=innodb; -- Insert some words and phrases to search for into the table. insert into quotes (author, quote, source) values ('Abraham Lincoln', 'Fourscore and seven years ago...', 'Gettysburg Address') , ('George Harrison', 'All those years ago...', 'Live In Japan') , ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.', '2010: The Year We Make Contact') , ('Benjamin Franklin', 'Early to bed and early to rise, makes a man healthy, wealthy, and wise.', 'Poor Richard''s Almanack') , ('James Thurber', 'Early to rise and early to bed makes a male healthy and wealthy and dead.', 'The New Yorker') , ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.', 'Men in Black'); -- Create the fulltext index create fulltext index idx on quotes(quote); -- You can also create fulltext index at create table time -- create table quotes -- ( id int unsigned auto_increment primary key -- , author varchar(64) -- , quote varchar(400) -- , source varchar(64) -- , fulltext (quote) -- ) engine=innodb;
InnoDB can create the index in parallel through InnoDB’s Fast Index Creation (FIC) feature, which supports parallel sorting. This technique actually allows you to tokenize, sort, and create an InnoDB
FULLTEXT index in parallel.
To control the parallel sort degree, you can use a new configuration variable
innodb_ft_sort_pll_degree (default 2, maximum 32). This variable specifies how many ways to parallelize the tokenization and sort operations. Experiments show, on a non-I/O bound system, the create index performance scales well with the number of CPUs and
The following table shows a quick run on 2.7 GB of Wikipedia data on an 8-core Linux x86 machine:
|MyISAM||11 min 47.90|
|InnoDB (default)||7 min 25.21 sec|
|InnoDB pll_degree||5 min 34.98|
|InnoDB pll_degree = 8||4 min 9 sec|
|InnoDB pll_degree = 16||3 min 39.51 sec|
Each InnoDB table with a
FULLTEXT index includes a column
FTS_DOC_ID. Specifying this column in the original table definition saves a rebuild of the entire table when an InnoDB
FULLTEXT index is created. To do so, add the column
FTS_DOC_ID (all uppercase) to the table with the FTS index. The column must be of
BIGINT UNSIGNED NOT NULL datatype. It does not need to be an auto-increment column, but
auto_increment could make the loading easier:
CREATE TABLE fts_test ( FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL, title VARCHAR(200), body TEXT) ENGINE=InnoDB; -- The unique "FTS_DOC_ID_INDEX" index on FTS_DOC_ID is also optional, -- without it, InnoDB will automatically create it.: CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);
Then load the data and:
CREATE FULLTEXT INDEX idx on fts_test (title, body);
Please note, currently this special column must be named
FTS_DOC_ID, and the name must be all uppercase. If you leave out this column, InnoDB adds
FTS_DOC_ID as a hidden column automatically when the
FULLTEXT index is created — and that is an expensive operation for an InnoDB table, so it is much more efficient to define the column yourself.
Query, Natural Language Search
Once the data is loaded and committed, you can run queries using the
AGAINST (search expression)operator to conduct actual searches. You can combine this operator with the
WHERE clause and similar clauses in the
-- Search for a single word. select author as "Monolith" from quotes where match(quote) against ('monolith' in natural language mode);
|Arthur C. Clarke|
1 row in set (0.01 sec)
-- The default minimum length is 3 rather than 4, and the search -- returns words that appear in a high proportion of the table rows. select author as "Ago" from quotes where match(quote) against ('ago' in natural language mode);
Arthur C. Clarke
4 rows in set (0.00 sec)
Query, Boolean Search
For more complicated searches, you can have multiple words and phrases, and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.
-- Search for a combination of words, not in the same order as the original. select author as "Ago and Years" from quotes where match(quote) against ('+ago +years' in boolean mode);
|Ago and Years|
Arthur C. Clarke
4 rows in set (0.00 sec)
Proximity search is a new feature in InnoDB full-text search. It is a special case of Boolean search using the
@ operator within the
AGAINST() string. You supply two or more words, double-quoted, within the single-quoted
AGAINST() string, followed by
@distance to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.
-- The starting points for these words are too far apart -- (not within 20 bytes), so no results. select quote as "Too Far Apart" from quotes where match(quote) against ('"early wise" @20' in boolean mode); Empty set (0.00 sec) -- But the starting points of all words are within 100 bytes, -- so this query does give results. select quote as "Early...Wise" from quotes where match(quote) against ('"early wise" @100' in boolean mode);
This results in:
|Early to bed and early to rise, makes a man healthy, wealthy, and wise.|
1 row in set (0.00 sec)
One of the key ideas behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions. You can design tables with both full-text columns and other columns. Multiple sessions can update the full-text column data (and other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.
One thing to note is that, just like the search feature in most transactional databases (such as Oracle Text for the Oracle Database), the tokenization for inserted strings is performed only at commit time, so a full-text search does not see the uncommitted data.
create table quotes_uncommitted ( author varchar(64) , quote varchar(4000) , source varchar(64) , fulltext(quote) , primary key (author, quote(128)) ); -- Insert but don't immediately commit. insert into quotes_uncommitted select author, quote, source from quotes; -- Within the same transaction, a full-text search does not see the uncommitted data. select count(author), author as "Uncommitted Results" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
Which gives us:
1 row in set (0.00 sec)
-- OK, let's start with some committed data in the table, then empty the table, -- then try some FTS queries, both before and after the commit. insert into quotes_uncommitted select author, quote, source from quotes; commit; delete from quotes_uncommitted; select count(author), author as "Deleted but still not committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
|count(author)||Deleted but still not committed|
1 row in set (0.00 sec)
rollback; select count(author), author as "Deleted and rolled back" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
Which results in:
|count(author)||Deleted and rolled back|
1 row in set (0.00 sec)
delete from quotes_uncommitted; commit; select count(author), author as "Deleted and committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
|count(author)||Deleted and committed|
1 row in set (0.00 sec)
insert into quotes_uncommitted select author, quote, source from quotes; commit; truncate table quotes_uncommitted; select count(author), author as "Truncated" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
With the result of:
1 row in set (0.00 sec)
DML (Insert, Update, and Delete)
As mentioned in the last section, the inserting string processing (tokenization) is performed only at commit time. This behavior is consistent with FTS in most transactional database systems, since the DMLs on tables with full-text search indexes are costly operations. Typically, you synchronize the results of DML operations to the FTS index periodically rather than in real time.
A typical example is the Oracle Text component of Oracle Database, which synchronizes table data with the search index periodically or manually. Starting in Oracle Database 11g, it now allows users to specify when data updates are reflected in the full-text index — manually, on commit, or at a regular interval.
For InnoDB Full-Text Search, the FTS index is updated at transaction commit time, when all the document tokenization and inserting into the FTS index (or associated cache) happens. All other DML operations still follow their usual rules about commit/rollback and isolation levels.
- Insert. The inserted document is tokenized at commit time and inserted into the FTS Index Cache. This cache has a configurable size (
innodb_ft_cache_size) with a default of 32 MB. Once this cache is full, it is synchronized to the on-disk tables that represent the index. During normal server shutdown, any content in the FTS Index Cache is synchronized to the
FULLTEXTindex on disk. If there is a server crash and the content of FTS Index Cache was not saved to disk, after the server reboot, the data is made consistent when you first query or insert into the
FULLTEXTindex. Any documents missing from the
FULLTEXTindex are read from the original table, retokenized, and added to the FTS Index Cache.
- Delete. When you delete rows from a table containing an InnoDB
FULLTEXTindex, InnoDB does not delete the corresponding word entries in the FTS auxiliary tables. At commit time, InnoDB records the deleted Document IDs in another auxiliary table named
DELETED. For each query, InnoDB consults this
DELETEDtable to filter out any deleted documents. This design makes the delete operation simple and fast, without the need to update thousands of word entries for each deleted document. Since the word entries are not removed from the
FULLTEXTindex automatically, you need to perform index optimization periodically (as described in the next section) to keep the index size reasonable.
- Update. For updates affecting any FTS-indexed columns, the updates are performed as a combination of Delete + Insert operations. In-place update happens only if the update affects none of the columns referenced by the
As just discussed, if there is substantial update and delete activity on an InnoDB
FULLTEXT index, the index could become bloated, as InnoDB logs the deleted Document IDs in the
DELETED auxiliary table. Over time, the
FULLTEXT index could grow bigger despite rows being removed from the original table. To resolve this, you can optimize the index. This operation does two things: it removes the deleted Document ID from the word’s Document ID/Position pair list (
ilist), and it consolidates multiple entries for the same word to one entry if possible by consolidating their Document ID/Position pair list (
Currently, MySQL runs the optimization operation only as part of the
OPTIMIZE TABLE command, when the
innodb_optimize_fulltext_only configuration variable is enabled:
mysql> set global innodb_optimize_fulltext_only=1; mysql> optimize table articles;
As the table could be large, and optimization could take a significant amount of time, you typically should do the optimization in stages. The configuration variable
innodb_ft_num_word_optimize specifies how many words to optimize for each
OPTIMIZE TABLE command. It defaults to 2000 words. When the next
OPTIMIZE TABLE command is issued, the server continues the process from where it left off.
Stopwords are common or trivial words, such as “the” and “an,” which are omitted from the
FULLTEXT index. They are potentially different for each language or application. InnoDB FTS provides two sources of stopwords:
- MySQL server predefined stopwords. If no user stopword list is defined, this default list is used. You can view this list by querying the table
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
- User-defined stopwords.You can define your own stopwords by creating a table with a single column named
value, with datatype
varchar, and pointing the global variable
innodb_ft_server_stopword_tableto this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. For example:
# Define a correctly formatted user stopword table create table user_stopword(value varchar(30)) engine = innodb; # Point to this stopword table with "db name/table name" set global innodb_ft_server_stopword_table = "test/user_stopword";
Earlier in this article, we demonstrated ways to make creating an InnoDB
FULLTEXT index faster. Preliminary results on a 2.7 GB Wikipedia data set are shown in the following table. The queries used include various combinations of include, exclude, wildcard, and proximity operators, even an intentional misspelling (“preassumtions”). They are identified by the number in the leftmost column:
1) select count(*) from wp where match(title,text) against ('Creationist +Abrahamic -creationism' in boolean mode); 2) select count(*) from wp where match(title,text) against ('Abrahamic'); 3) select count(*) from wp where match(title,text) against ('preassumtions +orangutan' in boolean mode); 4) select count(*) from wp where match(title,text) against ('orangutan +falsified ~naturalistically' in boolean mode); 5) select count(*) from wp where match(title,text) against ('+india* -leader +gandh*' in boolean mode); 6) select count(*) from wp where match(title,text) against ('"american culture"@09′ in boolean mode);
|Query||InnoDB (ms)||* InnoDB FTS Search Processing Time||MyISAM (ms)|
|6 (New InnoDB Feature)||5250||N/A|
* FTS Search processing time is the actual time spent by InnoDB full-text index scan.
As this version of FTS is fairly new, there are some aspects that have limitations that Oracle will remove in succedding versions. These include:
- Ranking: Currently, the FTS uses a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. The more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked.
- Stemming, which matches alternative forms such as singular, plural, past tense, and other forms derived from the same root word is not supported.
- Ideographic languages, Chinese, Japanese and Korean (referred to as CJK), which do not have word delimiters, do not yet have support for N-GRAM parsing. (MyISAM FTS has a similar limitation.)
- Single character set: Although the use of multiple character sets within a single table is supported, all columns in a
FULLTEXTindex must use the same character set and collation. (MyISAM FTS has a similar limitation.)
However, the InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents in transaction-intensive database applications. Subsequent releases will continue to increase its performance, usability, and feature set.