Unicode vs ASCII: What Every Developer Should Know
Text encoding bugs — garbled characters, truncated strings, database failures — are predictable once you understand how ASCII, Unicode, and UTF-8 relate.
ASCII: The 128-Character Foundation
ASCII (1963) maps 128 characters to values 0–127, fitting in 7 bits. UTF-8 is backward compatible with ASCII — any valid ASCII file is valid UTF-8.
A=65 a=97 0=48 space=32 newline=10 tab=9
Unicode: A Universal Character Set
Unicode assigns a unique code point (U+XXXX) to every character in every writing system. Unicode 15.1 defines over 149,000 code points. Unicode defines characters, not how they are stored — that is the job of UTF-8.
UTF-8: The Dominant Encoding
| Code point range | Bytes | Example |
|---|---|---|
| U+0000–U+007F | 1 | A, 0, newline |
| U+0080–U+07FF | 2 | é, ñ, ü |
| U+0800–U+FFFF | 3 | 中, € |
| U+10000–U+10FFFF | 4 | 😀, 🎉 |
Why Emojis Are 4 Bytes
Most emoji fall in the U+1F000+ range. This causes classic bugs: JavaScript's string.length returns 2 for one emoji (JS uses UTF-16 code units), and MySQL's utf8 charset silently drops emoji.
utf8mb4, not utf8. MySQL's utf8 only handles 3-byte characters and silently drops anything above U+FFFF — including all emoji.Common Encoding Bugs
| Symptom | Cause | Fix |
|---|---|---|
| £ instead of £ | UTF-8 decoded as Latin-1 | Declare charset=UTF-8 in HTTP header |
| Emoji stored as ???? | MySQL utf8 | ALTER TABLE to utf8mb4 |
| Wrong string length for emoji | JS counts UTF-16 code units | Use [...str].length |