UTF-8 is one of the smartest things I've seen. It is a byte stream encoding for simple values (like Unicode characters) that can be bigger than bytes. UTF-8 has some wonderful properties:
A first byte can contain between 2 and 7 bits of data. A first byte also determines the number of continuation bytes. Each continuation byte carries 6 bits of data.
UTF-8 has one unfortunate disadvantage, that many 16-bit characters are encoded in 3 bytes. This disadvantage is more than offset by its advantages, and by having a single, simple encoding that can work in all languages and contexts. The benefits range from greater reliability to better security. That is why JSON recommends UTF-8. UTF-8 is the good stuff. Thank you, Ken Thompson.
In my own work, I use a formulation that works well with 32-bit characters.
Binary | From | Thru | Range | Thru | Number of Bytes |
Total Data Bits |
---|---|---|---|---|---|---|
0xxxxxxx |
00 |
7F |
00 |
7F |
1 | 7 |
10xxxxxx |
80 |
BF |
continuation | |||
110xxxxx |
C0 |
DF |
80 |
7FF |
2 | 11 |
1110xxxx |
E0 |
EF |
800 |
FFFF |
3 | 16 |
11110xxx |
F0 |
F7 |
1 0000 |
1F FFFF |
4 | 21 |
111110xx |
F8 |
FB |
20 0000 |
3FF FFFF |
5 | 26 |
111111xx |
FC |
FF |
400 0000 |
FFFF FFFF |
6 | 32 |