UTF-8 Encoding
UTF-8 is a variable-width encoding for Unicode. Code points U+0000–U+007F use one byte identical to ASCII; larger scalars use 2–4 bytes. It is the dominant encoding on the web.
Byte layout
| Code point range | Bytes | Bit pattern (first byte …) |
|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 – U+FFFF | 3 | 1110xxxx + 2× 10xxxxxx |
| U+10000 – U+10FFFF | 4 | 11110xxx + 3× 10xxxxxx |
Continuation bytes always begin with 10. This self-synchronizing property lets decoders resync after corruption.
Surrogate pairs
UTF-16 uses surrogate code units for supplementary planes. UTF-8 never emits surrogate code points (U+D800–U+DFFF) — a well-formed UTF-8 sequence must not decode to them.
BOM
The byte order mark for UTF-8 is the bytes EF BB BF. Optional at the start of text files; discouraged in web protocols where charset is declared explicitly.
Invalid sequences. Decoders should reject overlong encodings, out-of-range values, and unexpected continuation bytes. “Lossy” repair strategies differ by language and library.