MySQL 5.1 supports two character sets for storing
Unicode data:
ucs2
, the UCS-2 Unicode character set.
utf8
, the UTF-8 encoding of the Unicode
character set.
In UCS-2 (binary Unicode representation), every character is
represented by a two-byte Unicode code with the most significant
byte first. For example: LATIN CAPITAL LETTER A
has the code 0x0041
and it is stored as a
two-byte sequence: 0x00 0x41
. CYRILLIC
SMALL LETTER YERU
(Unicode 0x044B
) is
stored as a two-byte sequence: 0x04 0x4B
. For
Unicode characters and their codes, please refer to the
Unicode Home Page.
Currently, UCS-2 cannot be used as a client character set, which
means that SET NAMES 'ucs2'
does not work.
The UTF-8 character set (transform Unicode representation) is an
alternative way to store Unicode data. It is implemented according
to RFC 3629. The idea of the UTF-8 character set is that various
Unicode characters are encoded using byte sequences of different
lengths:
Basic Latin letters, digits, and punctuation signs use one
byte.
Most European and Middle East script letters fit into a
two-byte sequence: extended Latin letters (with tilde, macron,
acute, grave and other accents), Cyrillic, Greek, Armenian,
Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte
sequences.
RFC 3629 describes encoding sequences that take from one to four
bytes. Currently, MySQL support for UTF-8 does not include
four-byte sequences. (An older standard for UTF-8 encoding is
given by RFC 2279, which describes UTF-8 sequences that take from
one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this
reason, sequences with five and six bytes are no longer used.)
Tip: To save space with UTF-8,
use VARCHAR
instead of CHAR
.
Otherwise, MySQL must reserve three bytes for each character in a
CHAR CHARACTER SET utf8
column because that is
the maximum possible length. For example, MySQL must reserve 30
bytes for a CHAR(10) CHARACTER SET utf8
column.