2019-11-20

UTF-8 and Kim

Unicode is a 21 bit character set. Its 21-ness seems like an odd choice, resulting from a long series of unintended consequences stretching back to the design of ASCII. There are three popular conventions for how to represent Unicode:

UTF-32, where each complete codepoint is placed in its own 32 bit word. This is the best representation from the perspective of programs that correctly manipulate texts, but it is unpopular because it seems wasteful.
UTF-16, where each codepoint is split into one or two codeunits. Each codeunit is placed in a 16 bit word. This is more memory efficient if most characters are situated in the Basic Multilingual Plane, but it is more error prone.
UTF-8, where each code point is split into a variable number of bytes. This is the preferred form to put on the wire, even though it makes CJK characters 50% bigger.

UTF-8 was designed by Ken Thompson. It has some very nice properties:

ASCII characters are unchanged. This is much liked by American programmers because it makes it easier to ignore the rest of the world.
Given a point in the middle of a text, it is possible to find the beginning of a multibyte character.
It sorts, preserving the Unicode sequence. The Unicode collating sequence is far from ideal, so this is perhaps not important.

UTF-8 has some disadvantages. It is a little complicated in encoding and decoding. It represents ASCII very efficiently, but for all of the other characters, it only delivers about 5 bits per byte. UTF-8 can have aliases where a character can be encoded multiple ways, requiring additional rules and policies. This is an attractive nuisance; some systems intentionally violate the rules.

I propose another representation for the wire called Kim. Kim is a very simple encoding that delivers 7 bits per byte. The bottom 7 bits of each byte contain data, which can be accumulated to produce 21 bit characters. The top bit of each byte is 1 if the byte is not the last byte of a character. The top bit of each byte is 0 if the byte is the last byte of a character, contributing the least significant 7 bits. This gives up the sorting property of UTF-8 in exchange for greater simplicity and performance.

Number of bytes	Codepoint range
Number of bytes	Kim	UTF-8
1	`U+007F`	`U+007F`
2	`U+3FFF`	`U+07FF`
3	`U+10FFFF`	`U+FFFF`
4		`U+10FFFF`

Characters in the ranges U+0800 thru U+3FFF and U+10000 thru U+10FFFF will be one byte smaller when encoded in Kim compared to UTF-8.

Kim is beneficial when using scripts such as Aramaic, Avestan, Balinese, Batak, Bopomofo, Buginese, Buhid, Carian, Cherokee, Coptic, Cyrillic, Deseret, Egyptian Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Hangul Jamo, Hanunoo, Hiragana, Kanbun, Kaithi, Kannada, Katakana, Kharoshthi, Khmer, Lao, Lepcha, Limbu, Lycian, Lydian, Malayalam, Mandaic, Meroitic, Miao, Mongolian, Myanmar, New Tai Lue, Ol Chiki, Old South Arabian, Old Turkic, Oriya, Osmanya, Pahlavi, Parthian, Phags-Pa, Phoenician, Samaritan, Sharada, Sinhala, Sora Sompeng, Tagalog, Tagbanwa, Takri, Tai Le, Tai Tham, Tamil, Telugu, Thai, Tibetan, Tifinagh, and Unified Canadian Aboriginal Syllabics.

As with UTF-8, it is possible to detect character boundaries within a byte sequence. A byte is a first byte if the preceding byte has a top bit of zero.

Kim can also be used to represent block lengths, large integers, and other transmission values. Applications that require negative numbers may use a leading 0x80 to represent the minus sign. A leading 0x80 must not be immediately followed by another 0x80 or by 0x00.

UTF-8 is one of the world's great inventions. While Kim is simpler and more efficient, it is not clear that it is worth the expense of transition. But it is ideal in new applications.