UTF-16
UTF-16 ( English for U niform Multiple-Octet Coded Character Set (UCS) T ransformation F ormat for 16 Planes of Group 00 ) is a coding with variable length for Unicode characters. UTF-16 is optimized for the frequently used characters from the Basic multilingual plane (BMP) . It is the oldest of the Unicode encoding formats .
General
In UTF-16 coding, each Unicode character is assigned a specially encoded string of one or two 16- bit units, i.e. H. of two or four bytes , so that - as with the other UTF formats - all Unicode characters can be mapped.
While UTF-8 is of central importance in Internet protocols , UTF-16 is used in many places for the internal representation of character strings , e.g. B. in current versions of .NET , Java and Tcl .
properties
Due to the coding of all characters of the BMP in two bytes, the UTF-16 coding has twice the space requirement for texts which mainly consist of Latin letters compared to suitable ISO-8859 coding or UTF-8. However, if many BMP characters are encoded beyond the code point U + 007F, UTF-16 requires a comparable amount of or less space than UTF-8.
In contrast to UTF-8, there is no coding reserve. If a UTF-16 coded text is interpreted as ISO 8859-1 , all letters contained in the latter coding can be recognized, but separated by zero bytes; with other ISO-8859 encodings the compatibility is worse.
standardization
UTF-16 is defined by both the Unicode Consortium and ISO / IEC 10646 . Unicode defines additional semantics . A more precise comparison can be found in Appendix C of the Unicode 4.0 standard. The ISO standard also defined a UCS-2 coding , in which, however, only 16-bit representations of the BMP are permitted.
Coding
Sign on the BMP
The 65,536 Unicode characters U + 0000 to U + FFFF of the BMP are each mapped directly to a single 16-bit word or to two bytes.
Sign outside the BMP
Unicode characters outside of the BMP (i.e. U + 10000 to U + 10FFFF) are each represented by two 16-bit words ( code units ) that belong together or by four bytes.
To do this, the number 65536 (10000 hex = size of the BMP) is first deducted from the code number of the character (here called U) , which results in a 20-bit number U 'in the range from 00000 hex to FFFFF hex . This is then divided into two blocks of 10 bits each:
- the first block (i.e. the 10 more significant bits of the code U ') is preceded by the bit sequence 11011 0 , the resulting 16-bit word consisting of two bytes is known as the high surrogate
- the second block (i.e. the 10 low-order bits of the code U ') is preceded by the bit sequence 11011 1 , the resulting 16-bit word of two bytes is known as the low surrogate .
The following code areas are specially designed for such surrogates, i. H. UTF-16 replacement characters, reserved and therefore do not contain any independent characters:
- from U + D800 to U + DBFF (2 10 = 1024 high surrogates)
- from U + DC00 to U + DFFF (2 10 = 1024 low surrogates).
When converting UTF-16-encoded character strings into UTF-8 byte sequences, it should be noted that pairs of high and low surrogates must first be combined again to form a Unicode character code before this is then converted into a UTF-8 code. Byte sequence can be converted (example in the description for UTF-8 ). Since this is often not taken into account, a different, incompatible coding has been established for the replacement characters, which was subsequently standardized as CESU-8 .
Byte order
Depending on which of the two bytes of a 16-bit word is transmitted or stored first, one speaks of Big Endian (UTF-16BE) or Little Endian (UTF-16LE). Regardless of this, the high surrogate word always comes before the low surrogate word.
For ASCII characters that are translated to UTF-16, this means that the added 0 character is in the most significant bit
- with Big Endian and
- in Little Endian .
With inadequate specified protocols it is recommended that the Unicode character U + FEFF ( BOM , byte order mark ) that a space with zero width and without line break ( zero width no-break space , stands) at the beginning of the data stream to set - if it is interpreted as the invalid Unicode character U + FFFE ( not a character ), this means that the byte order between sender and receiver is different and the bytes of every 16-bit word must be swapped at the receiver for the correctly evaluate the subsequent data stream.
Examples
Some coding examples for UTF-16 are given in the following table:
character | Unicode | Unicode binary | UTF-16BE binary | UTF-16BE hexadecimal |
---|---|---|---|---|
Letter y | U + 0079 | 00000000 0 1111001 | 00000000 0 1111001 | 00 79 |
Letter a | U + 00E4 | 00000000 11100100 | 00000000 11100100 | 00 E4 |
Euro sign € | U + 20AC | 00100000 10101100 | 00100000 10101100 | 20 AC |
Treble clef ? | U + 1D11E | 0000 000 1 110100 01 00011110 | 110110 00 0 0 110100 110111 01 00011110 | D8 34 DD 1E |
CJK ideogram ? | U + 24F5C | 0000 00 10 010011 11 01 011 100 | 110110 00 01 010011 110111 11 01 011 100 | D8 53 DF 5C |
The last two examples are outside the BMP. Since many fonts do not yet contain these new Unicode areas, the characters contained there cannot be displayed correctly on many platforms. Instead, a replacement character is shown, which serves as a placeholder. In the examples, only one or two bits are changed by subtracting 10000 hex (shown in magenta in the example) and the surrogates are formed from the bits that are created in this way.
Example calculation of surrogates
All numbers are given below on base 16.
Für die Unicode-Position v
SG-Word1 = + D800 SG-Word2 = + DC00
= 64321 SG-Word1 = + D800 = D950
SG-Word2 = + DC00
= DF21
See also
Individual evidence
- ↑ Unicode 4.0, Appendix C (PDF; 155 kB)