UTF-16

UTF-16 ( English for U niform Multiple-Octet Coded Character Set (UCS) T ransformation F ormat for 16 Planes of Group 00 ) is a coding with variable length for Unicode characters. UTF-16 is optimized for the frequently used characters from the Basic multilingual plane (BMP) . It is the oldest of the Unicode encoding formats .

General

In UTF-16 coding, each Unicode character is assigned a specially encoded string of one or two 16- bit units, i.e. H. of two or four bytes , so that - as with the other UTF formats - all Unicode characters can be mapped.

While UTF-8 is of central importance in Internet protocols , UTF-16 is used in many places for the internal representation of character strings , e.g. B. in current versions of .NET , Java and Tcl .

properties

Due to the coding of all characters of the BMP in two bytes, the UTF-16 coding has twice the space requirement for texts which mainly consist of Latin letters compared to suitable ISO-8859 coding or UTF-8. However, if many BMP characters are encoded beyond the code point U + 007F, UTF-16 requires a comparable amount of or less space than UTF-8.

In contrast to UTF-8, there is no coding reserve. If a UTF-16 coded text is interpreted as ISO 8859-1 , all letters contained in the latter coding can be recognized, but separated by zero bytes; with other ISO-8859 encodings the compatibility is worse.

standardization

UTF-16 is defined by both the Unicode Consortium and ISO / IEC 10646 . Unicode defines additional semantics . A more precise comparison can be found in Appendix C of the Unicode 4.0 standard. The ISO standard also defined a UCS-2 coding , in which, however, only 16-bit representations of the BMP are permitted.

Coding

Sign on the BMP

The 65,536 Unicode characters U + 0000 to U + FFFF of the BMP are each mapped directly to a single 16-bit word or to two bytes.

Sign outside the BMP

Education + internal composition of the two sub-blocks.
U 'is not the original code U, but the code after subtraction:
U' = U - 10000 _hex

Unicode characters outside of the BMP (i.e. U + 10000 to U + 10FFFF) are each represented by two 16-bit words ( code units ) that belong together or by four bytes.

To do this, the number 65536 (10000 _hex = size of the BMP) is first deducted from the code number of the character (here called U) , which results in a 20-bit number U 'in the range from 00000 _hex to FFFFF _hex . This is then divided into two blocks of 10 bits each:

the first block (i.e. the 10 more significant bits of the code U ') is preceded by the bit sequence 11011 0 , the resulting 16-bit word consisting of two bytes is known as the high surrogate
the second block (i.e. the 10 low-order bits of the code U ') is preceded by the bit sequence 11011 1 , the resulting 16-bit word of two bytes is known as the low surrogate .

The following code areas are specially designed for such surrogates, i. H. UTF-16 replacement characters, reserved and therefore do not contain any independent characters:

from U + D800 to U + DBFF (2 ¹⁰ = 1024 high surrogates)
from U + DC00 to U + DFFF (2 ¹⁰ = 1024 low surrogates).

When converting UTF-16-encoded character strings into UTF-8 byte sequences, it should be noted that pairs of high and low surrogates must first be combined again to form a Unicode character code before this is then converted into a UTF-8 code. Byte sequence can be converted (example in the description for UTF-8 ). Since this is often not taken into account, a different, incompatible coding has been established for the replacement characters, which was subsequently standardized as CESU-8 .

Byte order

Depending on which of the two bytes of a 16-bit word is transmitted or stored first, one speaks of Big Endian (UTF-16BE) or Little Endian (UTF-16LE). Regardless of this, the high surrogate word always comes before the low surrogate word.

For ASCII characters that are translated to UTF-16, this means that the added 0 character is in the most significant bit

with Big Endian and
in Little Endian .

With inadequate specified protocols it is recommended that the Unicode character U + FEFF ( BOM , byte order mark ) that a space with zero width and without line break ( zero width no-break space , stands) at the beginning of the data stream to set - if it is interpreted as the invalid Unicode character U + FFFE ( not a character ), this means that the byte order between sender and receiver is different and the bytes of every 16-bit word must be swapped at the receiver for the correctly evaluate the subsequent data stream.

Examples

Some coding examples for UTF-16 are given in the following table:

Examples of UTF-16 encodings
character	Unicode	Unicode binary	UTF-16BE binary	UTF-16BE hexadecimal
Letter y	U + 0079	00000000 0 1111001	00000000 0 1111001	00 79
Letter a	U + 00E4	00000000 11100100	00000000 11100100	00 E4
Euro sign €	U + 20AC	00100000 10101100	00100000 10101100	20 AC
Treble clef ?	U + 1D11E	0000 000 1 110100 01 00011110	110110 00 0 0 110100 110111 01 00011110	D8 34 DD 1E
CJK ideogram ?	U + 24F5C	0000 00 10 010011 11 01 011 100	110110 00 01 010011 110111 11 01 011 100	D8 53 DF 5C

The last two examples are outside the BMP. Since many fonts do not yet contain these new Unicode areas, the characters contained there cannot be displayed correctly on many platforms. Instead, a replacement character is shown, which serves as a placeholder. In the examples, only one or two bits are changed by subtracting 10000 _hex (shown in magenta in the example) and the surrogates are formed from the bits that are created in this way.

Example calculation of surrogates

All numbers are given below on base 16.

Für die Unicode-Position v

SG-Word1 =  ${\tfrac {v-10000}{400}}$  + D800
SG-Word2 =  $v\;{\bmod {\;}}400$  + DC00

 $v$         = 64321
SG-Word1 =  ${\tfrac {64321-10000}{400}}$  + D800
         = D950

SG-Word2 =  $64321\;{\bmod {\;}}400$  + DC00
         = DF21

Individual evidence

↑ Unicode 4.0, Appendix C (PDF; 155 kB)

[1] Unicode 4.0, Appendix C (PDF; 155 kB)