Skip to content

ASCII vs UTF-8: Why Emojis Break Things

I used to think that a "character" was just a "byte." If I had a string of 10 characters, it should be 10 bytes long.

That works fine if you only use the English alphabet. But the moment someone adds an emoji (like ⚡) or a character from another language, the math breaks.

Let's look at why character encoding is so confusing.

The Big Picture: Mapping Numbers to Letters

Computers only understand numbers. To show text, we need a "map" that says "this number = this letter."

[ Number ] --( Encoding )--> [ Character ]
    65     ---------------->      A
    66     ---------------->      B

ASCII: The Old School (7 bits)

Back in the 60s, we used ASCII. It used 7 bits, which meant it could only represent 128 characters (0 to 127). It had uppercase, lowercase, numbers, and some symbols.

It was simple: 1 character = 1 byte.

UTF-8: The New Standard (Variable length)

Today, we use UTF-8. It's brilliant because it's "backward compatible" with ASCII, but it can represent over 140,000 characters.

How? It uses variable length. - Standard English letters still use 1 byte. - Symbols like © use 2 bytes. - Emojis like use 3 or 4 bytes.


Wait, but why is this a problem?

If your code assumes that length of string = number of bytes, you'll run into bugs:

# In some languages...
s = "⚡"
print(len(s)) 
# You might expect 1, but you might get 3 or 4!

If you try to "cut" a string in the middle of a multi-byte character, you end up with "mojibake"—those weird `` symbols you see on old websites.


Common gotchas

  • I always forget that "UTF-8" and "Unicode" aren't the same thing. Unicode is the list of characters; UTF-8 is the way we encode that list into bytes.
  • Watch out for database limits: If your database column is VARCHAR(10), does it mean 10 bytes or 10 characters? It matters!

Try it yourself

Open your terminal and check the byte size of different strings:

echo -n "A" | wc -c
# Output: 1

echo -n "⚡" | wc -c
# Output: 3 (on most systems)

Further reading

— Nadeem 🔤

Comments