10.9.1. Unicode Character Sets
MySQL has two Unicode character sets. You can store text in
about 650 languages using these character sets.
The ucs2_hungarian_ci
and
utf8_hungarian_ci
collations were added in
MySQL 5.1.5.
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA)
described at
https://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
https://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The following discussion uses
utf8_unicode_ci
, but it is also true for
ucs2_unicode_ci
.
Currently, the utf8_unicode_ci
collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci
is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß
’ is
equal to ‘ss
’.
utf8_general_ci
is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
.
For example, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
:
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
:
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering with
utf8_unicode_ci
does not work well for a
language. For example, utf8_unicode_ci
works
fine for German and French, so there is no need to create
special utf8
collations for these two
languages.
utf8_general_ci
also is satisfactory for both
German and French, except that
‘ß
’ is equal to
‘s
’, and not to
‘ss
’. If this is acceptable for
your application, then you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it is
more accurate.
utf8_swedish_ci
, like other
utf8
language-specific collations, is derived
from utf8_unicode_ci
with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci
and
utf8_spanish2_ci
collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ
’ (n-tilde) is a
separate letter between ‘n
’ and
‘o
’. In addition, for traditional
Spanish, ‘ch
’ is a separate
letter between ‘c
’ and
d
, and ‘ll
’ is
a separate letter between ‘l
’ and
‘m
’