Byte layout

Code point rangeBytesBit pattern (first byte …)
U+0000 – U+007F10xxxxxxx
U+0080 – U+07FF2110xxxxx 10xxxxxx
U+0800 – U+FFFF31110xxxx + 2× 10xxxxxx
U+10000 – U+10FFFF411110xxx + 3× 10xxxxxx

Continuation bytes always begin with 10. This self-synchronizing property lets decoders resync after corruption.

Surrogate pairs

UTF-16 uses surrogate code units for supplementary planes. UTF-8 never emits surrogate code points (U+D800–U+DFFF) — a well-formed UTF-8 sequence must not decode to them.

BOM

The byte order mark for UTF-8 is the bytes EF BB BF. Optional at the start of text files; discouraged in web protocols where charset is declared explicitly.

Invalid sequences. Decoders should reject overlong encodings, out-of-range values, and unexpected continuation bytes. “Lossy” repair strategies differ by language and library.