Skip to content

// glossary

What is UTF-8?

UTF-8 is a variable-width character encoding that can represent every Unicode code point, using one to four bytes per character, and is the dominant encoding on the web.

UTF-8 (Unicode Transformation Format - 8-bit) is a variable-width character encoding that can represent every Unicode code point, using one to four bytes per character. It’s the dominant character encoding on the web, used by over 98% of all websites.

How it works

UTF-8 encodes characters using a variable number of bytes:

  • 1 byte (0-127): ASCII characters. A = 0x41, z = 0x7A. This makes UTF-8 backward-compatible with ASCII.
  • 2 bytes (128-2047): Latin extensions, Greek, Cyrillic, Arabic, Hebrew. é = 0xC3 0xA9.
  • 3 bytes (2048-65535): CJK characters, most of the Basic Multilingual Plane. = 0xE4 0xB8 0xAD.
  • 4 bytes (65536+): Emoji, historic scripts, math symbols. 🚀 = 0xF0 0x9F 0x9A 0x80.

The encoding is self-synchronizing — you can identify the start of any character by looking at a single byte. Leading bytes start with specific bit patterns (110xxxxx for 2-byte, 1110xxxx for 3-byte, 11110xxx for 4-byte), and continuation bytes always start with 10xxxxxx.

Why UTF-8 won

UTF-8 has three properties that made it ubiquitous:

  1. ASCII compatibility: Every valid ASCII document is also a valid UTF-8 document, byte for byte. This made adoption painless.
  2. Space efficiency: English text uses 1 byte per character (same as ASCII). Other scripts use 2-4 bytes only when needed.
  3. No byte-order issues: Unlike UTF-16 and UTF-32, UTF-8 doesn’t have endianness problems. No BOM (byte order mark) required.

Common encoding problems

The infamous é appearing instead of é happens when UTF-8 bytes are interpreted as Latin-1 (ISO 8859-1). The reverse — mojibake — produces garbled text when the encoding declaration doesn’t match the actual encoding.

Prevention rules:

  • Declare <meta charset="utf-8"> in HTML
  • Set Content-Type: application/json; charset=utf-8 in HTTP headers
  • Save source files as UTF-8 (most editors default to this now)
  • Use UTF-8 for database columns (utf8mb4 in MySQL — utf8 only supports 3-byte characters)

UTF-8 vs. UTF-16

UTF-16 uses 2 bytes for most characters and 4 bytes for supplementary characters. JavaScript strings and Java char types are internally UTF-16. This is why "🚀".length returns 2 in JavaScript — the emoji is a surrogate pair of two 16-bit code units.

Encode and decode UTF-8 with the UTF-8 Encoder/Decoder. Look up Unicode characters with the Unicode Character Lookup or browse the full set in the Unicode Table.

#Related Tools

#Related Terms

#Learn More