Strings are among the most commonly used data types in Python, and there might be times when you want to (or have to) work with strings containing or entirely made up of characters outside of the standard ASCII set (e.g. characters with accents or other markings).
Python 2.x provides a data type called a Unicode string for working with Unicode data using string encoding and decoding methods. If you want to learn more about Unicode strings, be sure to checkout Wikipedia’s article on Unicode.
Note: When executing a Python script that contains Unicode characters, you must put the following line at the top of the script, to tell Python that the code is UTF-8/Unicode formatted.
# -*- coding: utf-8 -*-
Python Unicode: Overview
In order to figure out what “encoding” and “decoding” is all about, let’s look at an example string:
>>> s = "Flügel"
We can see our string s has a non-ASCII character in it, namely “ü” or “umlaut-u.”. Assuming we’re in the standard Python 2.x interactive mode, let’s see what happens when we reference the string, and when it’s printed:
>>> s 'Fl\xfcgel' >>> print(s) Flügel
Printing gave us the value that we assigned to the variable, but something obviously happened along the way that turned it from what we typed into the interpreter to something seemingly incomprehensible. The non-ASCII character ü was translated into a code phrase, i.e. “\xfc,“ by a set of rules behind-the-scenes. In other words, it was encoded.
At this point, s is an 8-bit string, which to us basically means it isn’t a Unicode string. Let’s examine how to make a Unicode string with the same data. The simplest way is with a “u” prefix in front of the literal string marking it as a Unicode string:
u = u"Flügel"
If we reference and print
u like we did with
s, we’ll find something similar:
>>> u u'Fl\xfcgel' >>> print(u) Flügel
We can see that the code phrase for our “umlaut-u” is still “\xfc“ and it prints the same—so does that mean our Unicode string is encoded the same way as our 8-bit string s? To figure that out let’s look at what the
encode method does when we try it on
>>> u.encode('latin_1') 'Fl\xfcgel' >>> s.encode('latin_1') Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> s.encode('latin_1') UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 2: ordinal not in range(128)
Now it seems encoding the Unicode string (with the ‘latin-1’ encoding) retuned the same value as string
s, but the
encode method didn’t work on string
s. Since we couldn’t encode
s, what about decoding it? Will it give us the same value as
u? Let’s find out:
>>> s.decode('latin-1') u'Fl\xfcgel'
That’s exactly what it does after all. So then, what difference does it make that
s is an 8-bit string and
u is a Unicode string? They behave the same way, don’t they? In our “umlaut-u” example, there seemed to be little difference, aside from the “u” hanging out in front of the Unicode string.
Well, the difference is that the Unicode string
u was using a code phrase that the Unicode standard defines for the character “umlaut-u,” and the 8-bit string
s was using a code phrase that the “latin-1″ codec (rule-set) defines for “umlaut-u.”
OK, well… that’s great, but… they were still the same, right? So why does that matter?
To illustrate the difference and why it matters, let’s consider a new 8-bit string:
new_s = '\xe5\xad\x97'
Unlike the first one, our new 8-bit string is
only code phrases—completely incomprehensible.
Why didn’t we just type in (or copy-and-paste) the characters like the last 8-bit string? Well, assuming we’re still using the standard Python 2.x console/IDLE, we couldn’t type or paste this value into the interpreter — as it would not accept the value if we did. Why? Because our
new_s is an encoded string for an Asian script character (yes, only one character), and the interactive mode/IDLE is averse to such input (if you have the appropriate input keyboard installed for your system you can try this and find out).
The question now is how can we turn these code phrases into the character they’re supposed to display? In the first example using a print statement on
s worked fine, so it should be the same with
new_s, right? Let’s see what our unknown Asian script character is:
>>>print new_s å—
Uh-oh… that isn’t right. First of all, that is not an Asian script character. Secondly, it’s more than one character. Simply referencing
new_s would give us the string we assigned to it, and
To create our new Unicode string
new_u, we can’t follow the method in our first example — to do that we’d have to input the literal character of our string with a “u” prefix (we haven’t seen our character yet, and anyway the interactive mode/IDLE wouldn’t accept it as input).
However, we did get the value of
u by decoding
s, so in the same way we should be able to get the value for our
new_u by decoding
new_s. Let’s try decoding as we did in the first example:
>>> new_u = new_s.decode('latin_1') >>> new_u u'\xe5\xad\x97'
Great, now that we’ve stored the decoded
new_s string value using the same method as in our first example, let’s print our Unicode string and see what our script character is:
>>> print(new_u) å
Uh… Isn’t that the same thing we got when we tried to print the
new_s string?? So then using the Unicode string really isn’t any different?
Not so fast—there’s one detail that was purposefully glossed-over to prove a point: the encoding we used to decode the string is the same as the first example, the ‘latin-1’ codec. However, the 8-bit string
new_s was not encoded in ‘latin-1,’ it was encoded in ‘utf-8.’
Okay, so there was no way for you really to know that unless explicitly told, but this still illustrates the point: the encoding/codec/rule-set makes all the difference when encoding/decoding strings.
With the right encoding, let’s see what happens:
>>> new_u = new_s.decode('utf_8') >>> new_u u'\u5b57' >>> print(new_u) 字
FINALLY! Our long-lost script character has been found, and it looks pretty good. Now, try copying-and-pasting the character as input—you’ll find it doesn’t work (we’re still talking about Python 2.x interactive mode/IDLE).
You might also notice something different about the value of new_u, i.e. it appears to consist of only one code phrase (this time in the form ‘\uXXXX’). This Unicode standard has a unique code phrase for every character or script character that can possibly be displayed on your screen. With that in mind you can also tell that the first time we tried to decode new_s, the value was wrong (“u’\xe5\xad\x97’” has 3 code phrases, and for the Unicode standard, that means 3 unique characters).
Well, now that those annoying examples are finished, let’s recap the main points of all this hoopla:
- Strings are one of the most common data types in Python, and sometimes they’ll include non-ASCII characters.
- When strings contain non-ASCII characters, they can either be 8-bit strings (encoded strings), or they can be Unicode strings (decoded strings).
- To print or display some strings properly, they need to be decoded (Unicode strings).
- THE ENCODING/CODEC MAKES ALL THE DIFFERENCE WHEN ENCODING/DECODING STRINGS.
The encoding/codec is like the DNA of your string—even with the same nutrients (input), using the wrong DNA (codec) will give you an orange when you should have had an apple..