Unicode
Handing Unicode and bytes/ASCII
All the examples here are in Python 3.
Unicode escaped characters
Start a string with \u
to make a unicode characters using its code.
A unicode character will be displayed in human-readable form.
>>> '\u003e'
'>'
>>> print('\u003e')
>
>>> '\u00cd'
'ร'
>>> '\uabcd'
'๊ฏ'
>>> 'foo \u003e \u00cd \xf0\x9f\x98\x80 baz'
'foo > ร รฐ\x9f\x98\x80 baz'
Raw
If you make it a raw string, then Python will escape it.
>>> r'\u003e'
'\\u003e'
Same as above but with escaping backslash using a double backslash.
>>> '\\u003e'
'\\u003e'
>>> print(r'\u003e')
\u003e
Unicode characters
Note that in Python 3, all strings are unicode strings. There is no unicode
keyword anymore, only str
.
An emoji:
>>> '๐'
'๐'
Convert from string to bytes
>>> 'Hello ๐'.encode('utf-8')
b'Hello \xf0\x9f\x98\x80'
Here you could actually leave out utf-8
and get the same result, but it is good to be explicit.
Convert from bytes to string
>>> b'Hello \xf0\x9f\x98\x80'.decode('utf-8')
'Hello ๐'
Note the unicode characters are left as is.
>>> b'foo \u003e \u00cd \xf0\x9f\x98\x80 baz'.decode('utf-8')
'foo \\u003e \\u00cd ๐ baz'
Encode as ASCII
Convert from string to bytes. We specify the ASCII standard, which will give an error on unicode characters which cannot be represented..
The default behavior implies errors='strict'
and can raise an error.
>>> 'Hello ๐'.encode('ascii')
# UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f600' in position 6: ordinal not in range(128)
We can choose to replace or ignore unicode characters.
>>> 'Hello ๐ ั
ะตะปะปะพ world'.encode('ascii', errors='replace')
b'Hello ? ????? world'
>>> plain_text =
>>> 'Hello ๐ ั
ะตะปะปะพ world'.encode('ascii', errors='ignore')
b'Hello world'
Note youโll get bytes above, so you should convert back to string with .decode()
, so you can work with it as a string.
'Hello ๐ ั
ะตะปะปะพ world'.encode('ascii', errors='replace').decode()
That is useful for stripping out non-ascii characters.
Codecs
Using the built-in codecs module.
Example
>>> import codecs
>>> codecs.decode(r'\u003e', 'unicode-escape')
'>'
API
Some
codecs.encode(obj, encoding='utf-8', errors='strict')
codecs.decode(obj, encoding='utf-8', errors='strict')