ASCII vs UTF-8: Why Emojis Break Things
Why is my string length wrong? A guide to character encoding and how we went from 127 characters to over 140,000.
Sep 14, 2025 · by Nadeem Siddique
I used to think that a “character” was just a “byte.” If I had a string of 10 characters, it should be 10 bytes long.
That works fine if you only use the English alphabet. But the moment someone adds an emoji (like ⚡) or a character from another language, the math breaks.
Let’s look at why character encoding is so confusing.
The Big Picture: Mapping Numbers to Letters
Computers only understand numbers. To show text, we need a “map” that says “this number = this letter.”
[ Number ] --( Encoding )--> [ Character ] 65 ----------------> A 66 ----------------> BASCII: The Old School (7 bits)
Back in the 60s, we used ASCII. It used 7 bits, which meant it could only represent 128 characters (0 to 127). It had uppercase, lowercase, numbers, and some symbols.
It was simple: 1 character = 1 byte.
UTF-8: The New Standard (Variable length)
Today, we use UTF-8. It’s brilliant because it’s “backward compatible” with ASCII, but it can represent over 140,000 characters.
How? It uses variable length.
- Standard English letters still use 1 byte.
- Symbols like
©use 2 bytes. - Emojis like
⚡use 3 or 4 bytes.
Wait, but why is this a problem?
If your code assumes that length of string = number of bytes, you’ll run into bugs:
# In some languages...s = "⚡"print(len(s))# You might expect 1, but you might get 3 or 4!If you try to “cut” a string in the middle of a multi-byte character, you end up with “mojibake”—those weird “ symbols you see on old websites.
Common gotchas
- I always forget that “UTF-8” and “Unicode” aren’t the same thing. Unicode is the list of characters; UTF-8 is the way we encode that list into bytes.
- Watch out for database limits: If your database column is
VARCHAR(10), does it mean 10 bytes or 10 characters? It matters!
Try it yourself
Open your terminal and check the byte size of different strings:
echo -n "A" | wc -c# Output: 1
echo -n "⚡" | wc -c# Output: 3 (on most systems)Further reading
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets – The classic guide by Joel Spolsky.
- Kibibyte vs Kilobyte – More unit confusion!
— Nadeem 🔤