MySQL 5.1 Database Reference Guide :: 10.9.1 Unicode Character Sets

Previous / Next / Up / Table of Contents

10.9.1. Unicode Character Sets

MySQL Essentials
eBook

$8.99

eBookFrenzy.com

MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.

ucs2 (UCS-2 Unicode) collations:
- ucs2_bin
- ucs2_czech_ci
- ucs2_danish_ci
- ucs2_esperanto_ci
- ucs2_estonian_ci
- ucs2_general_ci (default)
- ucs2_hungarian_ci
- ucs2_icelandic_ci
- ucs2_latvian_ci
- ucs2_lithuanian_ci
- ucs2_persian_ci
- ucs2_polish_ci
- ucs2_roman_ci
- ucs2_romanian_ci
- ucs2_slovak_ci
- ucs2_slovenian_ci
- ucs2_spanish2_ci
- ucs2_spanish_ci
- ucs2_swedish_ci
- ucs2_turkish_ci
- ucs2_unicode_ci
utf8 (UTF-8 Unicode) collations:
- utf8_bin
- utf8_czech_ci
- utf8_danish_ci
- utf8_esperanto_ci
- utf8_estonian_ci
- utf8_general_ci (default)
- utf8_hungarian_ci
- utf8_icelandic_ci
- utf8_latvian_ci
- utf8_lithuanian_ci
- utf8_persian_ci
- utf8_polish_ci
- utf8_roman_ci
- utf8_romanian_ci
- utf8_slovak_ci
- utf8_slovenian_ci
- utf8_spanish2_ci
- utf8_spanish_ci
- utf8_swedish_ci
- utf8_turkish_ci
- utf8_unicode_ci

The ucs2_hungarian_ci and utf8_hungarian_ci collations were added in MySQL 5.1.5.

MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at https://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: https://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The following discussion uses utf8_unicode_ci, but it is also true for ucs2_unicode_ci.

Currently, the utf8_unicode_ci collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari.

The most significant feature in utf8_unicode_ci is that it supports expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘ß’ is equal to ‘ss’.

utf8_general_ci is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci.

For example, the following equalities hold in both utf8_general_ci and utf8_unicode_ci:

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The utf8_spanish_ci and utf8_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’. In addition, for traditional Spanish, ‘ch’ is a separate letter between ‘c’ and d, and ‘ll’ is a separate letter between ‘l’ and ‘m’

Previous / Next / Up / Table of Contents