Unicode, Fonts & Fancy TextJanuary 15, 2025

What Is Unicode? (The Complete Beginner's Guide)

A clear beginner's guide to Unicode: code points, UTF-8, planes, and why text shows up correctly across devices, fonts, and languages.

By Muhammad Umair Β· Founder & Editor at TextKit

What Is Unicode? (The Complete Beginner's Guide)

If you've ever pasted a βœ… emoji into a Slack message, used a fancy 𝓫𝓸𝓡𝓭 font on Instagram, or watched a Turkish friend's name arrive in your inbox intact instead of as ?, you have already benefited from Unicode β€” even if you've never heard of it. Unicode is the quiet backbone of modern text. It is the reason a tweet composed in Tokyo renders correctly on a phone in Toronto.

This guide explains what Unicode actually is, why it had to be invented, how it works under the hood, and why understanding it (even at a surface level) makes you better at everything from debugging broken text to choosing fonts for a website.

The problem Unicode was invented to solve

For most of computing history, text encoding was a mess. The original ASCII standard, finalized in 1963, assigned numbers to 128 characters: the English alphabet (uppercase and lowercase), digits, common punctuation, and a handful of control codes like "newline." Each character fit in 7 bits. That was fine if you only wrote English.

But the moment you needed an accented letter (Γ©), a German ß, a Japanese kanji, or a Russian Π―, ASCII broke down. Dozens of incompatible encodings sprang up to fill the gap: Latin-1 (ISO 8859-1) for Western European languages, Shift-JIS for Japanese, GB2312 for Chinese, KOI8-R for Russian. A file written in one encoding looked like garbage when opened in another. The infamous "mojibake" β€” text like é or Γ―ΒΏΒ½ β€” is the visible scar of an encoding mismatch.

The Unicode Consortium was founded in 1991 to end this. Their idea was radical in its simplicity: one number for every character, in every language, forever. No more "which code page?" No more guessing games. Just a single universal mapping.

What Unicode actually is (and isn't)

Unicode is not a font. It is not a piece of software. It is not an encoding (despite what people mean when they say "save the file as Unicode").

Unicode is a standard: a published, versioned list that assigns a unique number β€” called a code point β€” to every character the consortium has chosen to encode. As of Unicode 16.0 (released September 2024), there are over 150,000 assigned code points covering 170+ scripts, thousands of symbols, and nearly 4,000 emoji.

A code point is written as U+ followed by a four-to-six digit hexadecimal number. For example:

  • U+0041 β€” Latin capital A
  • U+00E9 β€” Latin small letter e with acute (Γ©)
  • U+03A9 β€” Greek capital Omega (Ξ©)
  • U+4E2D β€” CJK Unified Ideograph "middle" (δΈ­)
  • U+1F600 β€” Grinning face emoji (πŸ˜€)
  • U+1D400 β€” Mathematical bold capital A (𝐀)

The standard also defines rules for how characters combine (does Γ© count as one character or as e + Β΄?), how text should be sorted, how bidirectional text (like Arabic mixed with English) should flow, and much more.

The 17 planes

Code points are grouped into 17 planes, each holding 65,536 code points (the range U+XX0000 to U+XXFFFF, where XX is the plane number 00–10). In practice, almost everything you'll ever type lives in the first three:

| Plane | Range | What's there |

|---|---|---|

| Plane 0 (BMP β€” Basic Multilingual Plane) | U+0000–U+FFFF | Almost all modern scripts: Latin, Greek, Cyrillic, Arabic, Hebrew, CJK, common symbols, older emoji |

| Plane 1 (SMP β€” Supplementary Multilingual Plane) | U+10000–U+1FFFF | Historic scripts, musical symbols, mathematical alphanumeric symbols (where fancy fonts live), newer emoji |

| Plane 2 (SIP β€” Supplementary Ideographic Plane) | U+20000–U+2FFFF | Rare and historic CJK ideographs |

| Planes 3–13 | U+30000–U+DFFFF | Mostly unassigned (reserved for future use) |

| Plane 14 (SSP β€” Supplementary Special-purpose Plane) | U+E0000–U+EFFFF | Tag characters, language tags |

| Planes 15–16 | U+F0000–U+10FFFF | Private Use Areas (no agreed meaning; applications define their own) |

The split between BMP and supplementary planes matters because of a historical accident called UTF-16, which we'll get to in a moment.

Encodings: how code points become bytes

Here is the thing that trips up almost every beginner: Unicode defines the numbers, but it doesn't tell you how to write those numbers to disk. That's the job of an encoding β€” a specific recipe for turning code points into bytes. Unicode defines three official encodings: UTF-8, UTF-16, and UTF-32.

UTF-8

UTF-8 is the encoding of the modern web. As of 2025, it's used on over 98% of all web pages, and it's the default in every major programming language, database, and operating system.

It's variable-length: a single code point takes between 1 and 4 bytes. The genius of UTF-8 is that the original 128 ASCII characters (U+0000 to U+007F) are encoded as themselves, in a single byte. So a plain English text file in ASCII and the same file in UTF-8 are byte-for-byte identical. This backwards compatibility is the reason UTF-8 won.

Larger code points take more bytes:

| Code point range | Bytes | Example |

|---|---|---|

| U+0000–U+007F | 1 | A β†’ 0x41 |

| U+0080–U+07FF | 2 | Γ© β†’ 0xC3 0xA9 |

| U+0800–U+FFFF | 3 | δΈ­ β†’ 0xE4 0xB8 0xAD |

| U+10000–U+10FFFF | 4 | πŸ˜€ β†’ 0xF0 0x9F 0x98 0x80 |

UTF-16

UTF-16 uses 16-bit code units (2 bytes each). Most BMP characters fit in one unit; supplementary characters (like most emoji) require a pair of units called a surrogate pair. This is why "πŸ˜€".length in JavaScript returns 2 β€” JS strings are UTF-16, and the engine sees two code units, not one character. We'll come back to this; it's the source of countless bugs.

UTF-32

UTF-32 uses 4 bytes per code point, fixed-width. It's simple but wasteful β€” English text balloons to four times its ASCII size. It's almost never used for storage or transport, only for in-memory manipulation in some libraries.

Why this matters: real bugs you'll hit

A "character" is not a "byte," and it's not even a "code point" in Unicode. The standard draws careful distinctions:

  • A code point is a number assigned to a character.
  • A code unit is the chunk size your encoding uses (1 byte in UTF-8, 2 bytes in UTF-16, 4 in UTF-32).
  • A grapheme cluster is what a user perceives as one character. The family emoji πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ is a single grapheme cluster made up of seven code points (four people joined by three zero-width joiners).

This is why "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length returns 11 in JavaScript, 7 in Python 3, and 1 in Swift. Different languages count different things. When you truncate a string to "100 characters" for a UI, you might be splitting a grapheme in half β€” leaving a stray emoji piece or a detached accent.

Normalization: when "Γ©" β‰  "Γ©"

The letter Γ© can be encoded two ways in Unicode:

  1. Precomposed (NFC): U+00E9 β€” one code point.
  2. Decomposed (NFD): U+0065 (e) + U+0301 (combining acute accent) β€” two code points.

Both render identically. Both mean the same thing. But they are different byte sequences, so Γ© == Γ© can return false if one is precomposed and the other decomposed. This breaks string comparisons, deduplication, database lookups, and search. Unicode defines four normalization forms (NFC, NFD, NFKC, NFKD) to handle this. Most modern systems default to NFC.

How to actually look things up

The Unicode Standard is published free online. For day-to-day work, two tools are essential:

The BabelPad editor is unmatched for poking around obscure code points. For developers, Python's unicodedata module gives you name(), category(), decimal(), and normalize() out of the box.

A quick cheat sheet of blocks you'll meet constantly:

U+0000–U+007F   Basic Latin (ASCII)
U+0080–U+00FF   Latin-1 Supplement (Γ , Γ©, Γ±, ΓΌ, etc.)
U+0100–U+017F   Latin Extended-A (Central European)
U+0300–U+036F   Combining Diacritical Marks (accents, used by Zalgo)
U+0370–U+03FF   Greek and Coptic
U+0400–U+04FF   Cyrillic
U+2000–U+206F   General Punctuation (smart quotes, em dash, ellipsis)
U+2070–U+209F   Superscripts and Subscripts
U+20A0–U+20CF   Currency Symbols (€, β‚Ώ, β‚Ή)
U+2100–U+214F   Letterlike Symbols (β„‚, ℍ, β„•, β„š, ℝ, β„€)
U+2200–U+22FF   Mathematical Operators
U+2600–U+26FF   Misc Symbols (β˜€ β˜‚ ☎)
U+1F300–U+1F5FF Misc Symbols and Pictographs (emoji core)
U+1F600–U+1F64F Emoticons
U+1D400–U+1D7FF Mathematical Alphanumeric Symbols (fancy fonts!)

That last block is where most "fancy text" generators get their glyphs. We cover it in depth in Why Fancy Fonts Work Everywhere (The Unicode Secret).

Why this matters for everyone, not just engineers

You don't need to memorize code points to benefit from understanding Unicode. Here are practical payoffs:

  1. Avoiding the "smart quotes" bug. Word processors silently replace " with " and ". These look identical but are different code points, and they break CSV imports, JSON parsing, and shell scripts. Knowing this exists is half the battle.
  1. Picking the right font. A font is a mapping from code points to glyphs. If a font doesn't include a glyph for U+1F600, you get a tofu box (β–‘) instead of πŸ˜€. This is why websites ship font fallback stacks β€” the browser falls through to the next font until it finds one that has the glyph.
  1. Search and dedupe. Two strings can look identical and be unequal (precomposed vs decomposed). Normalization fixes this. Searching for cafΓ© should still find cafΓ©.
  1. Truncation without breaking emoji. If you cut a UTF-16 string at an arbitrary index, you can split a surrogate pair. The result is an unpaired surrogate that renders as οΏ½. Always truncate on grapheme boundaries.
  1. SEO and URL slugs. URLs can technically contain non-ASCII characters but are usually percent-encoded. Knowing that cafΓ© becomes caf%C3%A9 in a URL helps you debug weird analytics reports.
  1. Accessibility. Screen readers depend on Unicode's character properties to pronounce text correctly. A fancy 𝓱 (Mathematical Bold Script H) is not the letter H to a screen reader β€” it's "MATHEMATICAL BOLD SCRIPT CAPITAL H," which sounds awful. This is why we caution against decorative Unicode in body text.

A short, slightly opinionated history

Unicode 1.0 (1991) covered 7,161 characters from 24 scripts. It was originally planned as a 16-bit fixed-width encoding β€” 65,536 slots, "enough for everyone." That ambition collapsed within a decade as CJK expansion and the explosion of symbols demanded more room. UTF-16's surrogate mechanism was added in Unicode 2.0 (1996) as a workaround; UTF-8 was designed the same year by Ken Thompson and Rob Pike on a placemat in a New Jersey diner (literally β€” they were trying to make it work for the Plan 9 operating system).

The emoji revolution started in 2010 with Unicode 6.0, which adopted 722 emoji from Japanese carrier standards. Emoji have been added every year since, and they're now the most-discussed part of Unicode by a wide margin β€” though they remain a tiny fraction of the standard by code point count.

What Unicode is still figuring out

Unicode isn't done. Active debates include:

  • Han unification. Chinese, Japanese, and Korean share thousands of historically related characters. Unicode encodes them once, with regional glyph variants. This saves space but means a Japanese font and a Chinese font will render the same code point differently β€” sometimes in ways that confuse readers.
  • Minority and historic scripts. Dozens of scripts are still in the pipeline: Tangut, Wancho, Toto, Cypro-Minoan. Each addition is a multi-year process involving linguists and native communities.
  • Emoji representation. Every Unicode Technical Committee meeting debates which new emoji to add. The process is now partially crowdsourced through Unicode emoji proposals.
  • Security. Lookalike characters from different scripts (Latin a vs. Cyrillic Π°) enable phishing. Unicode's confusables database lists every known pair.

TL;DR

  • Unicode is a standard that assigns a unique number (code point) to every character in every language.
  • UTF-8, UTF-16, UTF-32 are encodings β€” recipes for writing those numbers as bytes. UTF-8 won the web.
  • A character you see is often a grapheme cluster, which may be one or many code points.
  • The same character can have multiple byte representations (precomposed vs. decomposed). Normalize before comparing.
  • Almost every weird text bug you'll encounter β€” mojibake, broken emoji, mismatched strings, smart-quote failures β€” traces back to one of these concepts.

If you remember nothing else from this guide, remember this: text is not bytes, and a character is not a code point. Once that idea settles in, the rest of Unicode starts to make sense.

For the next layer of practical Unicode β€” how those decorative bold/italic/script "fonts" you see on social media actually work β€” read our deep dive on why fancy fonts work everywhere.

Last reviewed: January 15, 2025. This article is part of TextKit's original content library. Spotted an error or have feedback? Tell us.