# What is UTF-8?

> UTF-8 is a variable-width character encoding that can represent every Unicode code point, using one to four bytes per character, and is the dominant encoding on the web.

- URL: https://www.browserutils.dev/glossary/utf-8
- Published: 2026-03-21
- Updated: 2026-03-16

---

**UTF-8 (Unicode Transformation Format - 8-bit)** is a variable-width character encoding that can represent every Unicode code point, using one to four bytes per character. It's the dominant character encoding on the web, used by over 98% of all websites.

## How it works

UTF-8 encodes characters using a variable number of bytes:

- **1 byte** (0-127): ASCII characters. `A` = `0x41`, `z` = `0x7A`. This makes UTF-8 backward-compatible with ASCII.
- **2 bytes** (128-2047): Latin extensions, Greek, Cyrillic, Arabic, Hebrew. `é` = `0xC3 0xA9`.
- **3 bytes** (2048-65535): CJK characters, most of the Basic Multilingual Plane. `中` = `0xE4 0xB8 0xAD`.
- **4 bytes** (65536+): Emoji, historic scripts, math symbols. `🚀` = `0xF0 0x9F 0x9A 0x80`.

The encoding is self-synchronizing — you can identify the start of any character by looking at a single byte. Leading bytes start with specific bit patterns (`110xxxxx` for 2-byte, `1110xxxx` for 3-byte, `11110xxx` for 4-byte), and continuation bytes always start with `10xxxxxx`.

## Why UTF-8 won

UTF-8 has three properties that made it ubiquitous:

1. **ASCII compatibility**: Every valid ASCII document is also a valid UTF-8 document, byte for byte. This made adoption painless.
2. **Space efficiency**: English text uses 1 byte per character (same as ASCII). Other scripts use 2-4 bytes only when needed.
3. **No byte-order issues**: Unlike UTF-16 and UTF-32, UTF-8 doesn't have endianness problems. No BOM (byte order mark) required.

## Common encoding problems

The infamous `Ã©` appearing instead of `é` happens when UTF-8 bytes are interpreted as Latin-1 (ISO 8859-1). The reverse — mojibake — produces garbled text when the encoding declaration doesn't match the actual encoding.

Prevention rules:
- Declare `<meta charset="utf-8">` in HTML
- Set `Content-Type: application/json; charset=utf-8` in HTTP headers
- Save source files as UTF-8 (most editors default to this now)
- Use UTF-8 for database columns (`utf8mb4` in MySQL — `utf8` only supports 3-byte characters)

## UTF-8 vs. UTF-16

UTF-16 uses 2 bytes for most characters and 4 bytes for supplementary characters. JavaScript strings and Java `char` types are internally UTF-16. This is why `"🚀".length` returns `2` in JavaScript — the emoji is a surrogate pair of two 16-bit code units.

Encode and decode UTF-8 with the [UTF-8 Encoder/Decoder](/tools/utf8-encoder-decoder). Look up Unicode characters with the [Unicode Character Lookup](/tools/unicode-character-lookup) or browse the full set in the [Unicode Table](/tools/unicode-table).