An Introduction to Python - Unicode Strings

On-line Guides

Eclipse Documentation

How To Guides

<<< previous

table of contents

next >>>

5.1.3 Unicode Strings

This manual section was written by Marc-Andre Lemburg mal at lemburg.com.
Python supports characters in different languages using the Unicode standard. Unicode data can be stored and manipulated in the same way as strings.

For example, creating Unicode strings in Python is as simple as creating normal strings:

    >>> u'Hello World !'
    u'Hello World !'

The prefix ‘u’ in front of the quote indicates that a Unicode string is to be created. If you want to include special characters in the string, you can do so using the Python Unicode-Escape encoding. The following example shows how:

    >>> u'Hello\u0020World !'
    u'Hello World !'

The escape sequence \u0020 inserts the Unicode character with the hexadecimal value 0x0020 (the space character) at the given position.

There is also a raw mode like the one for normal strings, using the prefix ‘ur’ to specify Raw-Unicode-Escape encoding of the string. It will only apply the above \uXXXX conversion if there are an uneven number of backslashes in front of the small 'u'.

Python provides additional functions for manipulating Unicode strings. The built-in function unicode() provides access to standard Unicode encodings such as latin-1, ascii, utf-8, and utf-16. The default encoding is normally set to ascii, which passes through characters in the range 0 to 127 and rejects any other characters with an error. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.

    >>> u"abc"
    u'abc'
    >>> str(u"abc")
    'abc'
    >>> u"\u00e4\u00f6\u00fc"
    u'\xe4\xf6\xfc'
    >>> str(u"\u00e4\u00f6\u00fc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: ASCII encoding error: ordinal not 
      in range(128)

To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding.

    >>> u"\u00e4\u00f6\u00fc".encode('utf-8')
    '\xc3\xa4\xc3\xb6\xc3\xbc'

If you have data in a specific encoding and want to produce a corresponding Unicode string from it, you can use the unicode() function with the encoding name as the second argument.

    >>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
    u'\xe4\xf6\xfc'

<<< previous

table of contents

next >>>

Published under the terms of the Python License

Design by Interspire