Let’s keep in touch! Join me on the Javier Tiniaco Leyba newsletter 📩

From ASCII to Unicode: How Computers Understand Text

Written in

by

Unicode character set encoding and decoding representation with different character sets

Every time you type a letter, send an emoji, or read text online, you’re interacting with one of the most foundational — yet invisible — layers of computing: character representation. It’s what ensures your name appears correctly on a web form, or your code doesn’t break because of a hidden symbol.

But long before the world agreed on how to handle diverse scripts and symbols, computers spoke very different “text languages.” Understanding how modern systems evolved from ASCII to Unicode, and what UTF-8, UTF-16, or UTF-32 actually mean, helps any developer, writer, or curious mind appreciate the true complexity behind simple text.

You’ll grasp the concepts in this post better if you read, understand, and experiment with this code, where you’ll see the concepts in practice.

Understanding the Basics

What is a Character?

character is an abstract symbol — a letter, number, punctuation mark, or even an emoji. The letter “A,” the digit “3,” and the smiley “😊” are all characters in this sense. Computers can’t store “letters” as we see them, so they use numbers to represent them.

What is a Character Set (Repertoire)?

character set (or character repertoire) is the complete collection of symbols that a system can represent. For example, one system might support only English letters, while another supports thousands of scripts and emojis.

What is a Code Point?

Each character in a set is assigned a unique numerical identifier called a code point. Think of it like a dictionary where every symbol has a page number. For example, in ASCII, “A” corresponds to code point 65. In Unicode, “😊” corresponds to U+1F60A.

Chacaracter vs. Glyph

A character is an abstract concept, a symbol representing meaning. A glyph, on the other hand, is how that character is visually rendered on your screen or printed page.

  • For example, the letter “A”, is a character, while “A” in Times New Roman font vs. “A” in Arial font are different glyphs of the same character.
  • The same character can have multiple glyphs due to font design, ligatures, or stylistic variations.
  • Another example: in Arabic, the letter “م” (meem) changes shape depending on its position in a word — all are glyphs of the same character.

Unicode assigns code points to characters, not to glyphs. How a glyph appears is handled separately by rendering engines and fonts.

ASCII: The Original Character Set

The Birth of ASCII

In the early 1960s, computers primarily spoke English. The American Standard Code for Information Interchange (ASCII) became the first widely adopted standard, using 7 bits (values from 0 to 127) to represent 128 characters.

Code Points in ASCII

ASCII maps uppercase and lowercase letters, digits, and control codes to specific numbers:

  • ‘A’ = 65
  • ‘a’ = 97
  • Space = 32

Advantages of ASCII

  • Compact and simple: Only one byte per character.
  • Compatible: Easy to transmit and interpret by early systems.
  • Standardized: Reduced communication errors between computers.

Limitations of ASCII

ASCII’s 128-character limit excluded accented characters (like “é”), non-Latin scripts, and special symbols used worldwide. Each region eventually extended ASCII differently to handle local symbols — leading to a messy landscape of incompatible encodings.

Other Notable Pre-Unicode Character Sets

  • ISO‑8859‑1 (Latin‑1): Added Western European symbols.
  • Windows‑1252: Microsoft’s extended Latin‑1 version used in Windows.
  • Shift‑JIS, GB2312, Big5: Region-specific systems for Japanese and Chinese text.

Each solved a local problem but broke global compatibility.

Unicode: A Universal Character Set

The Need for Unicode

By the 1990s, the internet was connecting a multilingual world. The old regional systems couldn’t coexist cleanly — a message in one encoding could be misread by another system entirely. Unicode emerged to unify global writing systems under one standard.

The Structure of Unicode

Unicode defines over 140,000 characters across all major scripts, mathematical symbols, and emojis. Each symbol is assigned:

  • code point (e.g. “A” = U+0041, “é” = U+00E9).
  • Optionally, metadata like category, direction, or script.

Examples of Unicode Code Points

CharacterNameCode Point
ALatin Capital Letter AU+0041
éLatin Small Letter E with AcuteU+00E9
😊Smiling Face with Smiling EyesU+1F60A

Advantages of Unicode

  • Universal: One system for all languages and symbols.
  • Consistent: Every character has a fixed identity.
  • Extensible: New code points can be added over time.

Challenges and Misconceptions

Unicode defines characters and code points, but not how those code points are stored as bytes. That’s where encodings like UTF-8 or UTF-16 come in — and is the source of frequent confusion among developers.

Unicode Planes

Unicode’s space of possible code points is divided into 17 planes, each containing 65,536 points.

(from U+0000 to U+10FFFF).
They’re like layers of an address space.

  • Basic Multilingual Plane (BMP): U+0000 to U+FFFF — contains almost all modern scripts (Latin, Cyrillic, Greek, etc.), digits, punctuation, and a large portion of symbols.
  • Supplementary Planes: Used for historic scripts, lesser-used languages, emojis, and advanced symbols.

Examples:

  • U+0041 → “A” (Latin Capital Letter A) — BMP
  • U+1F600 → 😀 (Grinning Face) — Supplementary Multilingual Plane (SMP)

Scripts used every day tend to fall within the BMP; characters in supplementary planes often require more bytes (because their code points are larger).

From Character Sets to Encodings

What is a Character Encoding?

character encoding defines how code points are represented as bytes that computers can actually store or transmit. While the character set is the abstract idea of which symbols exist, the encoding is the concrete method of writing them down in binary.

Analogy:

  • Character set: The alphabet and spelling rules.
  • Encoding: The handwriting system or font that writes the letters.

Why Encoding Matters

When two systems use different encodings, they may interpret the same sequence of bytes differently — producing unreadable text (e.g., “é” instead of “é”).

Unicode Encodings (UTF Family)

“UTF” stands for Universal Transformation Format — the collection of encoding rules that turn Unicode code points into bytes.

  • UTF-8: Uses 1–4 bytes per character. ASCII characters stay 1 byte, making it efficient for English text and backward compatible with older systems.
  • UTF-16: Uses 2 or 4 bytes. Efficient for Asian scripts but less for English text. Used internally by Windows and Java.
  • UTF-32: Uses exactly 4 bytes per character. Simplifies processing but doubles or quadruples memory use compared to UTF-8.

Comparison Between UTF Variants

EncodingByte LengthASCII CompatibleTypical UseAdvantagesDrawbacks
UTF-81–4 bytesYesWeb, Linux, APIsCompact, universalSlightly slower decoding
UTF-162 or 4 bytesPartialWindows, JavaEfficient for non-Latin scriptsLarger for ASCII text
UTF-32Fixed 4 bytesNoInternal processingSimple logicHigh memory use

Surrogate Pairs (UTF‑16 Detail)

UTF‑16 uses 16-bit code units. Characters in the Basic Multilingual Plane (BMP) fit neatly into a single 16-bit value. However, characters beyond the BMP (like many emojis or rare symbols) cannot fit into just two bytes — instead, they are represented as a surrogate pair, meaning two 16-bit units (4 bytes total).

Example:

  • Emoji “😊” = U+1F60A.
    • In UTF‑16, this becomes a surrogate pair: D83D DE0A (two 16-bit values).
    • In UTF‑8, it’s encoded as four bytes: F0 9F 98 8A.

If you’ve ever worked with JavaScript’s string.length and noticed that "😊".length returns 2, that’s because UTF‑16 counts code units, not Unicode characters.

UTF‑16’s surrogate pairs are a reminder that not every visible symbol equals one unit — this is why developers need to think carefully about encoding when measuring or slicing strings.

Code Unit vs. Code Point

code point is the logical identifier (e.g., U+1F60A). Whereas a code unit is the smallest storage chunk used by an encoding (8 bits for UTF‑8, 16 for UTF‑16, 32 for UTF‑32).

Example: In UTF‑16, “😊” = two 16-bit code units (D83D DE0A), one code point (U+1F60A), and one character (😊).

Keeping these distinctions clear helps when working in programming environments that expose either code units or code points, especially in string length calculations or slicing.

Endianness in UTF‑16 and UTF‑32

Computers store multi-byte data with different byte orders:

  • Big‑endian: most significant byte first.
  • Little‑endian: least significant byte first.

UTF‑16 and UTF‑32 may use either, so the same character can have different byte sequences depending on machine architecture.

Example:

  • Character “A” = U+0041
    • UTF‑16 big‑endian: 00 41
    • UTF‑16 little‑endian: 41 00

That’s why you’ll see file encodings like UTF‑16LE or UTF‑16BE, explicitly marking byte order.

Endianness doesn’t affect the logical code points, but it matters for consistent cross-platform interpretation of binary text data.

Byte Order Mark (BOM)

Byte Order Mark (BOM) is a special byte sequence placed at the start of a text file to signal which encoding (and potentially byte order) the file uses. It’s optional but can be helpful in ambiguous contexts.

Examples:

EncodingBOM BytesMeaning
UTF‑8EF BB BFFile is UTF‑8 encoded
UTF‑16 LEFF FEUTF‑16 little‑endian
UTF‑16 BEFE FFUTF‑16 big‑endian

Some text editors automatically add a BOM; others don’t. It can also cause subtle bugs, for instance, if a file with a UTF‑8 BOM is fed into a parser not expecting one. BOMs help detect encoding but can introduce invisible bytes that confuse systems expecting raw ASCII — so use them carefully.

Other Famous Encodings (Non-UTF Systems)

Before UTF encodings unified text processing, several other encodings tied specific byte sequences to specific scripts:

  • ISO‑8859‑1: Western Europe.
  • CP‑1252: Similar to ISO‑8859‑1 but adds more printable characters.
  • Shift‑JIS: Japanese.
  • GB2312 / Big5: Simplified and Traditional Chinese, respectively.

These older encodings each defined both the character set and the byte mapping — tightly coupling them. Unicode broke that link, defining one global character set (Unicode) and multiple possible binary encoding methods (UTF-8/16/32).

Unicode vs. UTF vs. Legacy Encodings — Summary

  • Character set (e.g., Unicode): List of all possible characters and their code points.
  • Encoding (e.g., UTF-8): The method for storing those code points as bytes.
  • Legacy encodings: Combine both in one region-specific design.
  • Modern Unicode: Decouples meaning (code point) from representation (encoding).

Encoding and Decoding Process

Encoding and decoding are inverse processes.

  • Encoding: Taking a sequence of characters and converting them into bytes.
  • Decoding: Taking bytes and converting them back to characters.

Example with “é” (U+00E9):

  • Unicode code point: U+00E9
  • UTF‑8 encoding: C3 A9
  • ISO‑8859‑1 encoding: E9

If a text saved in ISO‑8859‑1 is later read as UTF‑8, those bytes (E9) will display as “é” (mojibake).

Both sender and receiver must agree on the same encoding scheme — otherwise, text corruption occurs during decoding.

Storage Efficiency and Trade-offs

Different encodings affect memory use and efficiency — and that impacts performance across systems, databases, and networks.

Example trade‑offs:

  • UTF‑8:
    • For English text: highly efficient (1 byte per character).
    • For texts with many Asian characters: may use 3 bytes per symbol.
  • UTF‑16:
    • Common for East Asian languages; often more compact there.
    • But less efficient for English data.
  • UTF‑32:
    • Always predictable (4 bytes per character).
    • Wastes space but simplifies indexing.

Practical Example:
If you store 1 million ASCII characters:

  • In UTF‑8 → 1 MB.
  • In UTF‑32 → 4 MB.

That’s why UTF‑8 dominates the web — it’s a universal standard and memory‑efficient for Latin-based languages. Choosing an encoding affects not just correctness but also performance and storage costs, particularly in large-scale systems or data pipelines.

Why All This Matters

The Impact on Modern Tech

Text travels through operating systems, databases, and networks daily. Using a universal and consistent encoding like UTF-8 enables you to store multilingual data safely and ensures web pages, APIs, and files display correctly worldwide.

Common Encoding Problems

If you see gibberish like “é” instead of “é,” your data was likely encoded in one format (perhaps Latin‑1) and decoded in another (UTF‑8). This mismatch, called mojibake, highlights why consistent encoding is essential at every step of data handling.

Best Practices for Developers

  • Store and serve text in UTF‑8 by default.
  • Explicitly declare encoding in HTML, database schemas, and source code files.
  • Understand that “Unicode” (the character set) and “UTF‑8” (the encoding) are related but distinct.

Closing Thoughts

From the simplicity of ASCII to the universality of Unicode, humanity’s written languages have finally found a common digital home. Every character you see — from “A” to “😊” — travels as carefully encoded bytes across the internet, decoded faithfully into meaning on your screen.

Understanding this invisible infrastructure not only prevents bugs; it also deepens your appreciation of the linguistic and cultural complexity that modern technology must handle gracefully.

Let’s keep in touch! Join me on the Javier Tiniaco Leyba newsletter 📩

Leave a Reply

Discover more from Tiniaco Leyba

Subscribe now to keep reading and get access to the full archive.

Continue reading