ASCII vs UTF-8: Why Emojis Break Things
I used to think that a "character" was just a "byte." If I had a string of 10 characters, it should be 10 bytes long.
That works fine if you only use the English alphabet. But the moment someone adds an emoji (like ⚡) or a character from another language, the math breaks.
Let's look at why character encoding is so confusing.
The Big Picture: Mapping Numbers to Letters¶
Computers only understand numbers. To show text, we need a "map" that says "this number = this letter."
ASCII: The Old School (7 bits)¶
Back in the 60s, we used ASCII. It used 7 bits, which meant it could only represent 128 characters (0 to 127). It had uppercase, lowercase, numbers, and some symbols.
It was simple: 1 character = 1 byte.
UTF-8: The New Standard (Variable length)¶
Today, we use UTF-8. It's brilliant because it's "backward compatible" with ASCII, but it can represent over 140,000 characters.
How? It uses variable length. - Standard English letters still use 1 byte. - Symbols like © use 2 bytes. - Emojis like ⚡ use 3 or 4 bytes.
Wait, but why is this a problem?¶
If your code assumes that length of string = number of bytes, you'll run into bugs:
If you try to "cut" a string in the middle of a multi-byte character, you end up with "mojibake"—those weird `` symbols you see on old websites.
Common gotchas¶
- I always forget that "UTF-8" and "Unicode" aren't the same thing. Unicode is the list of characters; UTF-8 is the way we encode that list into bytes.
- Watch out for database limits: If your database column is
VARCHAR(10), does it mean 10 bytes or 10 characters? It matters!
Try it yourself¶
Open your terminal and check the byte size of different strings:
Further reading¶
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets – The classic guide by Joel Spolsky.
- Kibibyte vs Kilobyte – More unit confusion!
— Nadeem 🔤