Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

From the outset of my career as a developer, I frequently encountered the challenge of converting and normalizing text data between two systems. Often, text that displayed correctly in the original system would appear as a series of strange characters, such as ????, in the new system. This issue stems from a fundamental need for a solid understanding of Unicode, along with the associated concepts of encoding and decoding. In this series, we will review all these concepts together to better understand and solve such challenges.

Article content
Encoding, Decoding Unicode summary.


Understanding Unicode

As software developers, when we discuss "strings," we refer to them as sequences of characters. But what exactly is a "character," and how is it represented in a computer? Traditionally, characters were represented using the ASCII standard, which encodes 128 characters each with 1 byte. However, ASCII can only represent unaccented English characters, making it inadequate for global use.

Enter Unicode, a comprehensive text encoding standard that can represent all written text worldwide. In Unicode, each character is mapped to a unique number known as a "code point," which ranges from 0 to 1,114,111. A character in Unicode might be as simple as 'a' or 'b,' or as complex as an emoji like 😘 or 🥰.

Article content
definition of character
Article content
using


Encoding

Encoding is the process of converting a code point into a sequence of bytes (e.g., b'\xff\xfe\x03\x01') for representing in memory or storing on disk. The representation of each code point in memory or on disk depends on the encoding scheme, such as UTF-8 or UTF-16. While Unicode defines a "dictionary" for every character, encoding schemes determine how these numbers are represented in computer memory. UTF-8, for example, uses 1 byte for the first 128 Unicode letters (similar to ASCII) and 2 to 4 bytes for other characters. UTF-16 converts each character into 2 or 4 bytes, depending on the character.

Article content
same character, different byte sequence, 2 bytes in utf8, 4 bytes in utf16

Article content
encode an unsupported character.

Decoding

- Decoding is the reverse process, where a sequence of bytes from memory or disk is converted back into Unicode code points, then map to corresponding characters. Proper decoding requires knowledge of the original encoding scheme; otherwise, errors may occur, leading to incorrect character representations. It's crucial to use the correct encoding and decoding schema to avoid errors that can lead to data corruption or misinterpretation.

Article content
using incorrect vs correct decode schema

Conclusion

This article has outlined the basics of Unicode, encoding, and decoding. In future posts, we will delve deeper into how to handle string operations correctly and safely, including normalizing, comparing and sorting accented strings. Thank you for reading, and stay tuned for more!

Dương Xuân Đà

⚡️Java Software Engineer | Oracle Certified Professional

10mo

Thanks for sharing

Đinh Quang Tùng

Software Engineer at HBLab JSC

10mo

i will keep this in mind! Thanks for sharing

Phan Đức Hoàng Long

Software Engineer | .NET Developer | Database

10mo

Hay lắm anh ạ, đang hóng phần tới 👍

To view or add a comment, sign in

More articles by Vu Truong Huu

Insights from the community

Others also viewed

Explore topics