Advertisement

Unicode vs ASCII: What Every Developer Should Know

Text encoding bugs — garbled characters, truncated strings, database failures — are predictable once you understand how ASCII, Unicode, and UTF-8 relate.

ASCII: The 128-Character Foundation

ASCII (1963) maps 128 characters to values 0–127, fitting in 7 bits. UTF-8 is backward compatible with ASCII — any valid ASCII file is valid UTF-8.

A=65   a=97   0=48   space=32   newline=10   tab=9

Unicode: A Universal Character Set

Unicode assigns a unique code point (U+XXXX) to every character in every writing system. Unicode 15.1 defines over 149,000 code points. Unicode defines characters, not how they are stored — that is the job of UTF-8.

UTF-8: The Dominant Encoding

Code point rangeBytesExample
U+0000–U+007F1A, 0, newline
U+0080–U+07FF2é, ñ, ü
U+0800–U+FFFF3中, €
U+10000–U+10FFFF4😀, 🎉

Why Emojis Are 4 Bytes

Most emoji fall in the U+1F000+ range. This causes classic bugs: JavaScript's string.length returns 2 for one emoji (JS uses UTF-16 code units), and MySQL's utf8 charset silently drops emoji.

MySQL: Use utf8mb4, not utf8. MySQL's utf8 only handles 3-byte characters and silently drops anything above U+FFFF — including all emoji.

Common Encoding Bugs

SymptomCauseFix
£ instead of £UTF-8 decoded as Latin-1Declare charset=UTF-8 in HTTP header
Emoji stored as ????MySQL utf8ALTER TABLE to utf8mb4
Wrong string length for emojiJS counts UTF-16 code unitsUse [...str].length
Advertisement