CESU-8
CESU-8 (short for Compatibility Encoding Scheme for UTF-16: 8-Bit ) is a variant of UTF-8 that is described in the Unicode Technical Report # 26. The codepoint is first expressed in UTF-16 , then the result is recoded in UTF-8 as if it were UCS-2 .
Coding
CESU-8-coded text is created if, when encoding according to UTF-8, any UTF-16 coding of the output data is not taken into account, be it due to ignorance or because the program code originates from the time when Unicode only a 16- Bit character set was.
For characters from the area of the Basic Multilingual Plane (characters up to number 65,535), UTF-8 and CESU-8 are identical. Characters outside the BMP are represented by the UTF-16 coding by two 16-bit values each (from the range from D800 hex to DFFF hex reserved for this purpose ). If these two values are now individually converted to UTF-8, 3-byte sequences from the range ED A0 xx
... arise ED BF xx
, which can not occur in normal UTF-8 . A correct UTF-8 encoder, on the other hand, must first recognize and decode the UTF-16 coding of the input data (code values> 65536 can occur) and only then carry out the UTF-8 coding, with values> 65535 in 4 bytes -Sequences are coded that start with F0
hex to F4
hex .
use
Since this “wrong UTF-8 coding” has become more widespread, it was subsequently standardized by the Unicode Consortium, albeit under the new name CESU-8 . CESU-8 is expressly not recommended as a data exchange format, but only as an internal format if compatibility with UTF-16 is required.
CESU-8 is e.g. B. from the Oracle - database software used: With Version 8, a "UTF8" named font has been introduced, but which corresponds in reality the CESU-8 encoding. With version 9.0 a correct UTF-8 character set was introduced, which was named "AL32UTF8" in order to preserve the compatibility with existing, older databases.
example
Coding | Unicode code point | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
U + 0045 | U + 0205 | U + 10400 | |||||||||||||||||
UTF-8 | 45 | C8 | 85 | F0 | 90 | 90 | 80 | ||||||||||||
UTF-16 | 0045 | 0205 | D801 | DC00 | |||||||||||||||
CESU-8 | 45 | C8 | 85 | ED | A0 | 81 | ED | B0 | 80 |
Same example with binary representation
Coding | Hexadecimal | Binary | Unicode code point | |
---|---|---|---|---|
UTF-8 | 45 | 0100 0101 | U + 0045 ( E , Latin capital letter E) | |
UTF-16 | 00 45 | 0000 0000 0100 0101 | ||
CESU-8 | 45 | 0100 0101 | ||
UTF-8 | C8 85 | 110 0 1000 10 00 0101 | U + 0205 ( ȅ , Latin small letter E with double engraving) | |
UTF-16 | 02 05 | 0000 0010 0000 0101 | ||
CESU-8 | C8 85 | 110 0 1000 10 00 0101 | ||
UTF-8 | F0 90 90 80 | 1111 0 000 10 0 1 0000 10 01 0000 10 00 0000 | U + 10400 ( ? , Deseret capital letter long I) | |
UTF-16 | High surrogates | D8 01 | 1101 10 00 0000 0001 | |
Low surrogates | DC 00 | 1101 11 00 0000 0000 | ||
CESU-8 | High | ED A0 81 | 1110 1101 10 10 0000 10 00 0001 | |
Low | ED B0 80 | 1110 1101 10 11 0000 10 00 0000 |
Legend | |
---|---|
0100 0101 etc. | Data bits |
10000 hex | Level 0 size : Basic Multilingual Plane (subtracted for UTF-16 coding) |
110110 | UTF-16 high-surrogate coding bits |
110111 | UTF-16 low-surrogate coding bits |
110 ,
1110 , 11110 , 10 |
UTF-8 coding bits |