5.1.3 Unicode Strings
This manual section was written by Marc-Andre Lemburg mal at lemburg.com.
Python supports characters in different languages using
the Unicode standard. Unicode data can be stored and
manipulated in the same way as strings.
For example, creating Unicode strings in Python is as simple as creating
normal strings:
>>> u'Hello World !'
u'Hello World !'
The prefix ‘u’ in front of the quote indicates that a
Unicode string is to be created. If you want to include
special characters in the string, you can do so using the Python
Unicode-Escape encoding. The following example shows how:
>>> u'Hello\u0020World !'
u'Hello World !'
The escape sequence \u0020 inserts the Unicode
character with the hexadecimal value 0x0020 (the space character) at the
given position.
There is also a raw mode like the one for normal strings,
using the prefix ‘ur’ to specify
Raw-Unicode-Escape encoding of the string. It will only
apply the above
\uXXXX conversion if there are an uneven number of
backslashes in front of the small 'u'.
Python provides additional functions for manipulating Unicode strings.
The built-in function
unicode()
provides access to
standard Unicode encodings such as
latin-1 , ascii , utf-8 , and utf-16 .
The default encoding is normally set to ascii , which passes
through characters in the range 0 to 127 and rejects any other
characters with an error. When a Unicode string is printed,
written to a file, or converted with str() ,
conversion takes place using this default encoding.
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'
>>> u"\u00e4\u00f6\u00fc"
u'\xe4\xf6\xfc'
>>> str(u"\u00e4\u00f6\u00fc")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not
in range(128)
To convert a Unicode string into an 8-bit string using a specific
encoding, Unicode objects provide an encode() method
that takes one argument, the name of the encoding.
>>> u"\u00e4\u00f6\u00fc".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
If you have data in a specific encoding and want to produce a
corresponding Unicode string from it, you can use the
unicode() function with the encoding name as the second
argument.
>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
u'\xe4\xf6\xfc'
|