UTF-8 Encoding — Defient.io

Byte layout

Code point range	Bytes	Bit pattern (first byte …)
U+0000 – U+007F	1	`0xxxxxxx`
U+0080 – U+07FF	2	`110xxxxx` `10xxxxxx`
U+0800 – U+FFFF	3	`1110xxxx` + 2× `10xxxxxx`
U+10000 – U+10FFFF	4	`11110xxx` + 3× `10xxxxxx`

Continuation bytes always begin with 10. This self-synchronizing property lets decoders resync after corruption.

Surrogate pairs

UTF-16 uses surrogate code units for supplementary planes. UTF-8 never emits surrogate code points (U+D800–U+DFFF) — a well-formed UTF-8 sequence must not decode to them.

BOM

The byte order mark for UTF-8 is the bytes EF BB BF. Optional at the start of text files; discouraged in web protocols where charset is declared explicitly.

Invalid sequences. Decoders should reject overlong encodings, out-of-range values, and unexpected continuation bytes. “Lossy” repair strategies differ by language and library.

Byte layout

Surrogate pairs

BOM

Related tools

Related references

External authority sources