Вы находитесь на странице: 1из 2

Python Unicode Objects http://effbot.org/zone/unicode-objects.

htm

Pyt h on U n i cod e Obj ect s


Som e Obser vat i on s on W or k i n g W i t h N on -ASCI I
Ch ar act er Set s
This note pr ovides some br ief infor mation on best pr actices for wor king with
non-ASCI I data in Python 2.0 and later . As ever ything else on this site, this is
a wor k in pr ogr ess.
Updated June 21, 2004 | February 11, 2002 | Fredrik Lundh

Pythons Unicode string type stores characters from the Unicode character set.
In this set, each distinct character has its own number, the code point.
Unicode supports more than one million code points. Unicode characters
dont have an encoding; each character is represented by its code. The Unicode
string type uses some unknown mechanism to store the characters; in your
Python code, Unicode strings simply appear as sequences of characters, just
like 8-bit strings appear as sequences of bytes.

Observations:

1. Text files always contain encoded text, not characters. Each character in
the text is encoded as one or more bytes in the file.

2. Most popular encodings (UTF-8, ISO-8859-X, etc) are supersets of


ASCII. This means that the first 128 characters have the usual meaning,
and that the usual characters are used for line endings. In other words,
r eadl i n e( ) will work just fine.

3. You can mix Python Unicode strings with 8-bit Python strings, as long
as the 8-bit string only contains ASCII characters. A Unicode-aware
library may chose to use 8-bit strings for text that only contains ASCII,
to save space and time.

4. If you read a line of text from a file, you get bytes, not characters.

5. To decode an encoded string into a string of well-defined characters,


you have to know what encoding it uses.

6. To decode a string, use the decode( ) method on the input string, and
pass it the name of the encoding:
fileencoding = "iso-8859-1"

raw = file.readline()
txt = raw.decode(fileencoding)

(the result is a Python Unicode string).

1 de 2 11/04/2017 09:10 p.m.


Python Unicode Objects http://effbot.org/zone/unicode-objects.htm

The decode method was added in Python 2.2. In earlier versions (or if
you think it reads better), use the u n i code constructor instead:
txt = unicode(raw, fileencoding)

7. Pythons regular expression engine supports Unicode. You can apply


the same pattern to either 8-bit (encoded) or Unicode strings. To create
a regular expression pattern that uses Unicode character classes for \ w
(and \s, and \b), use the (?u) flag prefix, or the r e.U N I COD E flag:
pattern = re.compile("(?u)pattern")

pattern = re.compile("pattern", re.UNICODE)

8. To write a Unicode string to a file or other device, you have to convert it


to the encoding used by the file. The en code method converts from
Unicode to an encoded string.
out = txt.encode(encoding)

If the string contains characters that cannot be represented in the given


encoding, Python raises an exception. You can change this by passing in
a second argument to en code:
# s k i p bad c har s
out = txt.encode(encoding, "ignore")

# r epl ac e bad c har s wi t h " ?"


out = txt.encode(encoding, "replace")

For more on string encoding, see Conver ting Unicode Str ings to 8-bit
Str ings.

9. To print a Unicode string to your output device, you have to convert it


to the encoding used by your terminal. The en code( ) method converts
from Unicode back to an encoded string. You can use the
l ocal e.get def au l t l ocal e( ) function to get the current output
encoding.
i mpor t locale
language, output_encoding = locale.getdefaultlocale()

pr i nt txt.encode(output_encoding)

Ther e ar e lots of shor tcuts in Python, including coded str eams, using default
locales for patter n matching, I SO-8859-1 as a subset of Unicode, etc, but
thats outside the scope of this note. At least for the moment.

rendered by a django application. hosted by webfaction.

2 de 2 11/04/2017 09:10 p.m.

Вам также может понравиться