CESU-8

CESU-8 (short for Compatibility Encoding Scheme for UTF-16: 8-Bit ) is a variant of UTF-8 that is described in the Unicode Technical Report # 26. The codepoint is first expressed in UTF-16 , then the result is recoded in UTF-8 as if it were UCS-2 .

Coding

CESU-8-coded text is created if, when encoding according to UTF-8, any UTF-16 coding of the output data is not taken into account, be it due to ignorance or because the program code originates from the time when Unicode only a 16- Bit character set was.

For characters from the area of the Basic Multilingual Plane (characters up to number 65,535), UTF-8 and CESU-8 are identical. Characters outside the BMP are represented by the UTF-16 coding by two 16-bit values each (from the range from D800 _hex to DFFF _hex reserved for this purpose ). If these two values are now individually converted to UTF-8, 3-byte sequences from the range ED A0 xx... arise ED BF xx, which can not occur in normal UTF-8 . A correct UTF-8 encoder, on the other hand, must first recognize and decode the UTF-16 coding of the input data (code values> 65536 can occur) and only then carry out the UTF-8 coding, with values> 65535 in 4 bytes -Sequences are coded that start with F0_hex to F4_hex .

use

Since this “wrong UTF-8 coding” has become more widespread, it was subsequently standardized by the Unicode Consortium, albeit under the new name CESU-8 . CESU-8 is expressly not recommended as a data exchange format, but only as an internal format if compatibility with UTF-16 is required.

CESU-8 is e.g. B. from the Oracle - database software used: With Version 8, a "UTF8" named font has been introduced, but which corresponds in reality the CESU-8 encoding. With version 9.0 a correct UTF-8 character set was introduced, which was named "AL32UTF8" in order to preserve the compatibility with existing, older databases.

example

Coding	Unicode code point
Coding	U + 0045	U + 0205		U + 10400
UTF-8	45	C8	85	F0		90		90		80
UTF-16	0045	0205		D801				DC00
CESU-8	45	C8	85	ED	A0		81	ED	B0		80

Same example with binary representation

Coding		Hexadecimal	Binary	Unicode code point
UTF-8		45	0100 0101	U + 0045 ( E , Latin capital letter E)
UTF-16		00 45	0000 0000 0100 0101
CESU-8		45	0100 0101
UTF-8		C8 85	110 0 1000 10 00 0101	U + 0205 ( ȅ , Latin small letter E with double engraving)
UTF-16		02 05	0000 0010 0000 0101
CESU-8		C8 85	110 0 1000 10 00 0101
UTF-8		F0 90 90 80	1111 0 000 10 0 1 0000 10 01 0000 10 00 0000	U + 10400 ( ? , Deseret capital letter long I)
UTF-16	High surrogates	D8 01	1101 10 00 0000 0001
UTF-16	Low surrogates	DC 00	1101 11 00 0000 0000
CESU-8	High	ED A0 81	1110 1101 10 10 0000 10 00 0001
CESU-8	Low	ED B0 80	1110 1101 10 11 0000 10 00 0000

Legend
0100 0101 etc.	Data bits
10000 _hex	Level 0 size : Basic Multilingual Plane (subtracted for UTF-16 coding)
110110	UTF-16 high-surrogate coding bits
110111	UTF-16 low-surrogate coding bits
110 , 1110 , 11110 , 10	UTF-8 coding bits

Web links

Unicode Technical Report # 26