|
9.7.3 Introduction to locales
Full locale description consists of 3 parts: xx_YY.ZZZZ.
-
xx: ISO 639 language codes (lower case)
-
YY: ISO 3166 country codes (upper case)
-
ZZZZ: codeset, i.e., character set or encoding
identifier.
For language codes and country codes, see pertinent description in the
info gettext.
Please note this codeset part may be normalized internally to achieve cross
platform compatibility by removing all - and by converting all
characters into lower case. Typical codesets are:
-
UTF-8: Unicode for all regions, mostly in 1-3 Octets (new de
facto standard)
-
ISO-8859-1: western Europe (de facto old standard)
-
ISO-8859-2: eastern Europe (Bosnian, Croatian, Czech,
Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian)
-
ISO-8859-3: Maltese
-
ISO-8859-5: Macedonian, Serbian
-
ISO-8859-6: Arabic
-
ISO-8859-7: Greek
-
ISO-8859-8: Hebrew
-
ISO-8859-9: Turkish
-
ISO-8859-11: Thai (=TIS-620)
-
ISO-8859-13: Latvian, Lithuanian, Maori
-
ISO-8859-14: Welsh
-
ISO-8859-15: western Europe with euro
-
KOI8-R: Russian
-
KOI8-U: Ukrainian
-
CP1250: Czech, Hungarian, Polish (MS Windows origin)
-
CP1251: Bulgarian, Byelorussian (MS Windows origin)
-
eucJP: Unix style Japanese (=ujis)
-
eucKR: Unix style Korean
-
GB2312: Unix style Simplified Chinese (=GB, =eucCN) for zh_CN
-
Big5: Traditional Chinese for zh_TW
-
sjis: Microsoft style Japanese (Shift-JIS)
As for the meaning of basic encoding system jargons:
-
ASCII: 7 bits (0-0x7f)
-
ISO-8859-?: 8 bits (0-0xff)
-
ISO-10646-1: Universal Character Set (UCS) (31 bits,
0-0x7fffffff)
-
UCS-2: First 16 bit of UCS as straight 2 Octets (Unicode:
0-0xffff)
-
UCS-4: UCS as straight 4 Octets (UCS: 0-0x7fffffff)
-
UTF-8: UCS encoded in 1-6 Octets (mostly in 3 Octets)
-
ISO-2022: 7 bits (0-0xff) with the escape sequence.
ISO-2022-JP is the most popular encoding for the Japanese e-mail.
-
EUC: 8 bits + 16 bits combination (0-0xff), Unix style
-
Shift-JIS: 8 bits + 16 bits combination (0-0xff), Microsoft
style.
ISO-8859-?, EUC, ISO-10646-1, UCS-2, UCS-4, and UTF-8 share the same code with
ASCII for the 7 bit characters. EUC or Shift-JIS uses high-bit characters
(0x80-0xff) to indicate that part of encoding is 16 bit. UTF-8 also uses
high-bit characters (0x80-0xff) to indicate non 7 bit character sequence bytes
and this is the most sane encoding system to handle non-ASCII characters.
Please note the byte order difference of Unicode implementation:
-
Standard UCS-2, UCS-4: big endian
-
Microsoft UCS-2, UCS-4: little endian for ix86
(machine-dependent)
See
Convert a text file with
recode , Section 8.6.12 for conversion between various
character sets. For more see Introduction to
i18n .
|
|