Understanding Unicode (Part 3): Handle Decoding Error
In a previous article, we discussed how Python handles encoding errors. In this article, we'll explore the errors that can occur during the decoding process, which is when we convert a sequence of bytes back into a string.
Handling Decoding Errors
Errors in decoding can occur when the byte sequence contains an invalid byte that doesn’t correspond to any character in the chosen encoding scheme. For example:
# Encoding a string in UTF-8
>> b1 = 'cà phê'.encode('utf-8')
# Outputs: b'c\xc3\xa0 ph\xc3\xaa'
# Creating an invalid byte sequence by removing a byte \xa0
>> b1_invalid = b'c\xc3 ph\xc3\xaa'
>> b1_invalid.decode('utf8'))
# Causes an error:
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
When a valid UTF-8 byte sequence like \xc3\xa0 is replaced by \xc3 (which is not valid on its own in UTF-8), a UnicodeDecodeError is raised. It's similar to removing a letter from a word in English: "Go" is a valid word, but if you remove the "o", the remaining "G" isn't a meaningful word on its own.
However, some 8-bit encoding schemes like cp1252 or iso8859_1 will decode any byte sequence they encounter, without indicating any errors, leading to incorrect text display:
Recommended by LinkedIn
# Attempting to decode the same invalid byte sequence with 'cp1252'
>> b1_invalid.decode('cp1252')
# Outputs: 'cà phê'
>> b'\xc3'.decode('cp1252')
# Outputs: 'Ã'
>> b'\xaa'.decode('cp1252')
# Outputs: 'ª'
Because cp1252 decodes each byte it sees without considering the whole bytes sequence, the two bytes \xc3\xaa in UTF-8 represent the character ê. But when decoded with cp1252, it interprets \xc3 as à and \xaa as ª, resulting in garbled text.
For example, early in my career, I used an older version of SQL Management Studio to display text from a database. The default encoding scheme of this tool was cp1252. So when it displayed text that was actually encoded with UTF-16 in the database, the result looked strange. Here's an example of how it appeared:
Conclusion
We’ve discussed what happens when you use the wrong decoding scheme, which doesn't match the original encoding of the text. We also saw how popular encodings like utf8 and cp1252 handle byte sequences. In the next articles, we'll dive deeper into the complexities of handling accented strings, such as comparing and sorting them. Thanks for reading, and I hope you join me in exploring more in the following articles.
⚡Presales Assistant @ SoftwareOne | Modern Work & Security
10moThanks for sharing!
⚡Working in the information technology industry
10movery helpful
⚡️Software Engineer, AI and Algorithms Enthusiast.
10motrước em cũng làm 1 cái chức năng convert utf8 <-> unicode (tiếng nhật hàn trung...) mà có 1 số ký tự unicode ko encode sang utf8 được, mấy encoder hay tìm 1 ký tự giống thế trong bảng unicode để thay thế, thế nên khi decode nó lại ko ra cái unicode string ban đầu, nói chung là mệt mỏi. 😅
Software Engineer
10moI'll keep this in mind