12.7. Full-Text Search Functions
MATCH (col1
,col2
,...) AGAINST (expr
[search_modifier
])
search_modifier:
{
IN BOOLEAN MODE
| IN NATURAL LANGUAGE MODE
| IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
| WITH QUERY EXPANSION
}
MySQL has support for full-text indexing and searching:
A full-text index in MySQL is an index of type
FULLTEXT
.
Full-text indexes can be used only with
MyISAM
tables, and can be created only for
CHAR
, VARCHAR
, or
TEXT
columns.
A FULLTEXT
index definition can be given in
the CREATE TABLE
statement when a table is
created, or added later using ALTER TABLE
or CREATE INDEX
.
For large datasets, it is much faster to load your data into a
table that has no FULLTEXT
index and then
create the index after that, than to load data into a table
that has an existing FULLTEXT
index.
Full-text searching is performed using MATCH() ...
AGAINST
syntax. MATCH()
takes a
comma-separated list that names the columns to be searched.
AGAINST
takes a string to search for, and an
optional modifier that indicates what type of search to perform.
The search string must be a literal string, not a variable or a
column name. There are three types of full-text searches:
A boolean search interprets the search string using the rules
of a special query language. The string contains the words to
search for. It can also contain operators that specify
requirements such that a word must be present or absent in
matching rows, or that it should be weighted higher or lower
than usual. Common words such as “some” or
“then” are stopwords and do not match if present
in the search string. The IN BOOLEAN MODE
modifier specifies a boolean search. For more information, see
Section 12.7.1, “Boolean Full-Text Searches”.
A natural language search interprets the search string as a
phrase in natural human language (a phrase in free text).
There are no special operators. The stopword list applies. In
addition, words that are present in more than 50% of the rows
are considered common and do not match. Full-text searches are
natural language searches if the IN NATURAL LANGUAGE
MODE
modifier is given or if no modifier is given.
A query expansion search is a modification of a natural
language search. The search string is used to perform a
natural language search. Then words from the most relevant
rows returned by the search are added to the search string and
the search is done again. The query returns the rows from the
second search. The IN NATURAL LANGUAGE MODE WITH
QUERY EXPANSION
or WITH QUERY
EXPANSION
modifier specifies a query expansion
search. For more information, see
Section 12.7.2, “Full-Text Searches with Query Expansion”.
The IN NATURAL LANGUAGE MODE
and IN
NATURAL LANGUAGE MODE WITH QUERY EXPANSION
modifiers
were added in MySQL 5.1.7.
Constraints on full-text searching are listed in
Section 12.7.4, “Full-Text Restrictions”.
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','DBMS stands for DataBase ...'),
-> ('How To Use MySQL Well','After you went through a ...'),
-> ('Optimizing MySQL','In this tutorial we will show ...'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL vs. YourSQL','In the following database comparison ...'),
-> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body)
-> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+
| id | title | body |
+----+-------------------+------------------------------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
The MATCH()
function performs a natural
language search for a string against a text
collection. A collection is a set of one or more
columns included in a FULLTEXT
index. The
search string is given as the argument to
AGAINST()
. For each row in the table,
MATCH()
returns a relevance value; that is, a
similarity measure between the search string and the text in that
row in the columns named in the MATCH()
list.
By default, the search is performed in case-insensitive fashion.
However, you can perform a case-sensitive full-text search by
using a binary collation for the indexed columns. For example, a
column that uses the latin1
character set of
can be assigned a collation of latin1_bin
to
make it case sensitive for full-text searches.
When MATCH()
is used in a
WHERE
clause, as in the example shown earlier,
the rows returned are automatically sorted with the highest
relevance first. Relevance values are non-negative floating-point
numbers. Zero relevance means no similarity. Relevance is computed
based on the number of words in the row, the number of unique
words in that row, the total number of words in the collection,
and the number of documents (rows) that contain a particular word.
For natural-language full-text searches, it is a requirement that
the columns named in the MATCH()
function be
the same columns included in some FULLTEXT
index in your table. For the preceding query, note that the
columns named in the MATCH()
function
(title
and body
) are the
same as those named in the definition of the
article
table's FULLTEXT
index. If you wanted to search the title
or
body
separately, you would need to create
separate FULLTEXT
indexes for each column.
It is also possible to perform a boolean search or a search with
query expansion. These search types are described in
Section 12.7.1, “Boolean Full-Text Searches”, and
Section 12.7.2, “Full-Text Searches with Query Expansion”.
The preceding example is a basic illustration that shows how to
use the MATCH()
function where rows are
returned in order of decreasing relevance. The next example shows
how to retrieve the relevance values explicitly. Returned rows are
not ordered because the SELECT
statement
includes neither WHERE
nor ORDER
BY
clauses:
mysql> SELECT id, MATCH (title,body)
-> AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score
-> FROM articles;
+----+------------------+
| id | score |
+----+------------------+
| 1 | 0.65545833110809 |
| 2 | 0 |
| 3 | 0.66266459226608 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
+----+------------------+
6 rows in set (0.00 sec)
The following example is more complex. The query returns the
relevance values and it also sorts the rows in order of decreasing
relevance. To achieve this result, you should specify
MATCH()
twice: once in the
SELECT
list and once in the
WHERE
clause. This causes no additional
overhead, because the MySQL optimizer notices that the two
MATCH()
calls are identical and invokes the
full-text search code only once.
mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root'
-> IN NATURAL LANGUAGE MODE) AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root'
-> IN NATURAL LANGUAGE MODE);
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
The MySQL FULLTEXT
implementation regards any
sequence of true word characters (letters, digits, and
underscores) as a word. That sequence may also contain apostrophes
('
), but not more than one in a row. This means
that aaa'bbb
is regarded as one word, but
aaa''bbb
is regarded as two words. Apostrophes
at the beginning or the end of a word are stripped by the
FULLTEXT
parser; 'aaa'bbb'
would be parsed as aaa'bbb
.
The FULLTEXT
parser determines where words
start and end by looking for certain delimiter characters; for
example, ‘
’ (space),
‘,
’ (comma), and
‘.
’ (period). If words are not
separated by delimiters (as in, for example, Chinese), the
FULLTEXT
parser cannot determine where a word
begins or ends. To be able to add words or other indexed terms in
such languages to a FULLTEXT
index, you must
preprocess them so that they are separated by some arbitrary
delimiter such as ‘"
’.
In MySQL 5.1, it is possible to write a plugin that
replaces the built-in full-text parser. For details, see
Section 27.2, “The MySQL Plugin Interface”. For example parser plugin source
code, see the plugin/fulltext
directory of a
MySQL source distribution.
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum
length of words that are found by full-text searches is four
characters.
Words in the stopword list are ignored. A stopword is a word
such as “the” or “some” that is so
common that it is considered to have zero semantic value.
There is a built-in stopword list, but it can be overwritten
by a user-defined list.
The default stopword list is given in
Section 12.7.3, “Full-Text Stopwords”. The default minimum word
length and stopword list can be changed as described in
Section 12.7.5, “Fine-Tuning MySQL Full-Text Search”.
Every correct word in the collection and in the query is weighted
according to its significance in the collection or query.
Consequently, a word that is present in many documents has a lower
weight (and may even have a zero weight), because it has lower
semantic value in this particular collection. Conversely, if the
word is rare, it receives a higher weight. The weights of the
words are combined to compute the relevance of the row.
Such a technique works best with large collections (in fact, it
was carefully tuned this way). For very small tables, word
distribution does not adequately reflect their semantic value, and
this model may sometimes produce bizarre results. For example,
although the word “MySQL” is present in every row of
the articles
table shown earlier, a search for
the word produces no results:
mysql> SELECT * FROM articles
-> WHERE MATCH (title,body)
-> AGAINST ('MySQL' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)
The search result is empty because the word “MySQL”
is present in at least 50% of the rows. As such, it is effectively
treated as a stopword. For large datasets, this is the most
desirable behavior: A natural language query should not return
every second row from a 1GB table. For small datasets, it may be
less desirable.
A word that matches half of the rows in a table is less likely to
locate relevant documents. In fact, it most likely finds plenty of
irrelevant documents. We all know this happens far too often when
we are trying to find something on the Internet with a search
engine. It is with this reasoning that rows containing the word
are assigned a low semantic value for the particular
dataset in which they occur. A given word may exceed
the 50% threshold in one dataset but not another.
The 50% threshold has a significant implication when you first try
full-text searching to see how it works: If you create a table and
insert only one or two rows of text into it, every word in the
text occurs in at least 50% of the rows. As a result, no search
returns any results. Be sure to insert at least three rows, and
preferably many more. Users who need to bypass the 50% limitation
can use the boolean search mode; see
Section 12.7.1, “Boolean Full-Text Searches”.