Unicode

All the examples here are in Python 3.

Unicode escaped characters

Start a string with \u to make a unicode characters using its code.

A unicode character will be displayed in human-readable form.

>>> '\u003e'
'>'

>>> print('\u003e')
>

>>> '\u00cd'
'Í'
>>> '\uabcd'
'ꯍ'

>>> 'foo \u003e \u00cd \xf0\x9f\x98\x80 baz'
'foo > Í ð\x9f\x98\x80 baz'

If you make it a raw string, then Python will escape it.

>>> r'\u003e'
'\\u003e'

Same as above but with escaping backslash using a double backslash.

>>> '\\u003e'
'\\u003e'

>>> print(r'\u003e')
\u003e

Note that in Python 3, all strings are unicode strings. There is no unicode keyword anymore, only str.

An emoji:

>>> '😀'
'😀'

>>> 'Hello 😀'.encode('utf-8')
b'Hello \xf0\x9f\x98\x80'

Here you could actually leave out utf-8 and get the same result, but it is good to be explicit.

>>> b'Hello \xf0\x9f\x98\x80'.decode('utf-8')
'Hello 😀'

Note the unicode characters are left as is.

>>> b'foo \u003e \u00cd \xf0\x9f\x98\x80 baz'.decode('utf-8')
'foo \\u003e \\u00cd 😀 baz'

Convert from string to bytes. We specify the ASCII standard, which will give an error on unicode characters which cannot be represented..

The default behavior implies errors='strict' and can raise an error.

>>> 'Hello 😀'.encode('ascii')
# UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f600' in position 6: ordinal not in range(128)

We can choose to replace or ignore unicode characters.

>>> 'Hello 😀 хелло world'.encode('ascii', errors='replace')
b'Hello ? ????? world'
>>> plain_text = 

>>> 'Hello 😀 хелло world'.encode('ascii', errors='ignore')
b'Hello   world'

Note you’ll get bytes above, so you should convert back to string with .decode(), so you can work with it as a string.

'Hello 😀 хелло world'.encode('ascii', errors='replace').decode()

That is useful for stripping out non-ascii characters.

Using the built-in codecs module.

>>> import codecs
>>> codecs.decode(r'\u003e', 'unicode-escape')
'>'

Some

codecs.encode(obj, encoding='utf-8', errors='strict')

codecs.decode(obj, encoding='utf-8', errors='strict')