Douglas Crockford

Blog

Books

Videos

2024 Appearances

Slides

JavaScript

Misty

JSLint

JSON

Github

Electric Communities

Flickr Photo Album

LinkedIn

Mastodon/Layer8

ResearchGate

Pronouns: pe/per

About

UTF-8

UTF-8 is one of the smartest things I've seen. It is a byte stream encoding for simple values (like Unicode characters) that can be bigger than bytes. UTF-8 has some wonderful properties:

A first byte can contain between 2 and 7 bits of data. A first byte also determines the number of continuation bytes. Each continuation byte carries 6 bits of data.

UTF-8 has one unfortunate disadvantage, that many 16-bit characters are encoded in 3 bytes. This disadvantage is more than offset by its advantages, and by having a single, simple encoding that can work in all languages and contexts. The benefits range from greater reliability to better security. That is why JSON recommends UTF-8. UTF-8 is the good stuff. Thank you, Ken Thompson.

In my own work, I use a formulation that works well with 32-bit characters.

Binary From Thru Range Thru Number
of Bytes
Total
Data Bits
0xxxxxxx 00 7F 00 7F 1 7
10xxxxxx 80 BF continuation
110xxxxx C0 DF 80 7FF 211
1110xxxx E0 EF 800 FFFF 316
11110xxx F0 F7 1 0000 1F FFFF 421
111110xx F8 FB 20 0000 3FF FFFF 526
111111xx FC FF 400 0000 FFFF FFFF 632