CESU-8

from Wikipedia, the free encyclopedia

CESU-8 (short for Compatibility Encoding Scheme for UTF-16: 8-Bit ) is a variant of UTF-8 that is described in the Unicode Technical Report # 26. The codepoint is first expressed in UTF-16 , then the result is recoded in UTF-8 as if it were UCS-2 .

Coding

CESU-8-coded text is created if, when encoding according to UTF-8, any UTF-16 coding of the output data is not taken into account, be it due to ignorance or because the program code originates from the time when Unicode only a 16- Bit character set was.

For characters from the area of ​​the Basic Multilingual Plane (characters up to number 65,535), UTF-8 and CESU-8 are identical. Characters outside the BMP are represented by the UTF-16 coding by two 16-bit values each (from the range from D800 hex to DFFF hex reserved for this purpose ). If these two values ​​are now individually converted to UTF-8, 3-byte sequences from the range ED A0 xx... arise ED BF xx, which can not occur in normal UTF-8 . A correct UTF-8 encoder, on the other hand, must first recognize and decode the UTF-16 coding of the input data (code values> 65536 can occur) and only then carry out the UTF-8 coding, with values> 65535 in 4 bytes -Sequences are coded that start with F0hex to F4hex .

use

Since this “wrong UTF-8 coding” has become more widespread, it was subsequently standardized by the Unicode Consortium, albeit under the new name CESU-8 . CESU-8 is expressly not recommended as a data exchange format, but only as an internal format if compatibility with UTF-16 is required.

CESU-8 is e.g. B. from the Oracle - database software used: With Version 8, a "UTF8" named font has been introduced, but which corresponds in reality the CESU-8 encoding. With version 9.0 a correct UTF-8 character set was introduced, which was named "AL32UTF8" in order to preserve the compatibility with existing, older databases.

example

Coding Unicode code point
U + 0045 U + 0205 U + 10400
UTF-8 45 C8 85 F0 90 90 80
UTF-16 0045 0205 D801 DC00
CESU-8 45 C8 85 ED A0 81 ED B0 80

Same example with binary representation

Coding Hexadecimal Binary Unicode code point
UTF-8 45 0100 0101 U + 0045 ( E , Latin capital letter E)
UTF-16 00 45 0000 0000 0100 0101
CESU-8 45 0100 0101
UTF-8 C8 85 110 0 1000 10 00 0101 U + 0205 ( ȅ , Latin small letter E with double engraving)
UTF-16 02 05 0000 0010 0000 0101
CESU-8 C8 85 110 0 1000 10 00 0101
UTF-8 F0 90 90 80 1111 0 000 10 0 1 0000 10 01 0000 10 00 0000 U + 10400 ( ? , Deseret capital letter long I)
UTF-16 High surrogates D8 01 1101 10 00 0000 0001
Low surrogates DC 00 1101 11 00 0000 0000
CESU-8 High ED A0 81 1110 1101 10 10 0000 10 00 0001
Low ED B0 80 1110 1101 10 11 0000 10 00 0000
Legend
0100 0101 etc. Data bits
10000 hex Level 0 size : Basic Multilingual Plane (subtracted for UTF-16 coding)
110110 UTF-16 high-surrogate coding bits
110111 UTF-16 low-surrogate coding bits
110 ,

1110 , 11110 , 10

UTF-8 coding bits

Web links