Code points and scalar values

Valid Unicode code points are integers from 0 to 0x10FFFF, excluding the surrogate range U+D800–U+DFFF. A Unicode scalar value is any code point that is not a surrogate.

Planes

The codespace is divided into 17 planes of 65,536 code points each. Plane 0 is the Basic Multilingual Plane (BMP). Emoji and historic scripts often live in supplementary planes.

Normalization

Some characters can be composed or decomposed equivalently (e.g. precomposed “é” vs “e” + combining acute). Unicode defines normalization forms NFC, NFD, NFKC, NFKD. NFC is typical for the web; NFD is common on macOS filenames.

Grapheme clusters

What users perceive as one “character” may span multiple code points (skin tone modifiers, ZWJ sequences). String length in code units ≠ visual length. Use locale-aware segmentation APIs for editing and cursor movement.