Unicode vs ASCII: What Every Developer Should Know

By WordCaseFix · 8 min read · Developer

Text encoding bugs — garbled characters, truncated strings, database failures — are predictable once you understand how ASCII, Unicode, and UTF-8 relate.

ASCII: The 128-Character Foundation

ASCII (1963) maps 128 characters to values 0–127, fitting in 7 bits. UTF-8 is backward compatible with ASCII — any valid ASCII file is valid UTF-8.

A=65 a=97 0=48 space=32 newline=10 tab=9

Unicode: A Universal Character Set

Unicode assigns a unique code point (U+XXXX) to every character in every writing system. Unicode 15.1 defines over 149,000 code points. Unicode defines characters, not how they are stored — that is the job of UTF-8.

UTF-8: The Dominant Encoding

Code point range	Bytes	Example
U+0000–U+007F	1	A, 0, newline
U+0080–U+07FF	2	é, ñ, ü
U+0800–U+FFFF	3	中, €
U+10000–U+10FFFF	4	😀, 🎉

Why Emojis Are 4 Bytes

Most emoji fall in the U+1F000+ range. This causes classic bugs: JavaScript's string.length returns 2 for one emoji (JS uses UTF-16 code units), and MySQL's utf8 charset silently drops emoji.

MySQL: Use utf8mb4, not utf8. MySQL's utf8 only handles 3-byte characters and silently drops anything above U+FFFF — including all emoji.

Common Encoding Bugs

Symptom	Cause	Fix
Â£ instead of £	UTF-8 decoded as Latin-1	Declare charset=UTF-8 in HTTP header
Emoji stored as ????	MySQL utf8	ALTER TABLE to utf8mb4
Wrong string length for emoji	JS counts UTF-16 code units	Use `[...str].length`