2019-06-24

UTF-8

UTF-8 is one of the smartest things I've seen. It is a byte stream encoding for simple values (like Unicode characters) that can be bigger than bytes. UTF-8 has some wonderful properties:

ASCII is a proper subset, so all ASCII encoded streams are also UTF-8 streams.
The first byte of a multi-byte character tells the number of bytes.
Continuation bytes are easily distinguished from first bytes.
The sort order is preserved, not that matters for Unicode.

A first byte can contain between 2 and 7 bits of data. A first byte also determines the number of continuation bytes. Each continuation byte carries 6 bits of data.

UTF-8 has one unfortunate disadvantage, that many 16-bit characters are encoded in 3 bytes. This disadvantage is more than offset by its advantages, and by having a single, simple encoding that can work in all languages and contexts. The benefits range from greater reliability to better security. That is why JSON recommends UTF-8. UTF-8 is the good stuff. Thank you, Ken Thompson.

In my own work, I use a formulation that works well with 32-bit characters.

Binary	From	Thru	Range	Thru	Number of Bytes	Total Data Bits
`0xxxxxxx`	`00`	`7F`	`00`	`7F`	1	7
`10xxxxxx`	`80`	`BF`	continuation
`110xxxxx`	`C0`	`DF`	`80`	`7FF`	2	11
`1110xxxx`	`E0`	`EF`	`800`	`FFFF`	3	16
`11110xxx`	`F0`	`F7`	`1 0000`	`1F FFFF`	4	21
`111110xx`	`F8`	`FB`	`20 0000`	`3FF FFFF`	5	26
`111111xx`	`FC`	`FF`	`400 0000`	`FFFF FFFF`	6	32