Unicode Transformation Format

from Wikipedia, the free encyclopedia

A Unicode Transformation Format , also UCS Transformation Format , abbreviated to UTF , is a method for mapping Unicode characters onto sequences of bytes .

There are various transformation formats for representing the Unicode characters for the purpose of electronic data processing. All characters ( code points ) contained in the Unicode standard can be displayed in each of the formats . Each of these formats can also be converted into another UTF variant without loss.

The various formats differ in terms of their space requirements on storage media ( storage efficiency ), the coding and decoding effort ( runtime behavior ) and their compatibility with other (older) coding types, such as ASCII . For example, while some formats allow very efficient access ( random access ) to individual characters within the character string , others use storage space sparingly. Therefore, when selecting a specific Unicode transformation format, the most suitable one for the intended application area must be determined.

UTF-8, UTF-16 and UTF-32

  • UTF-32 always encodes a character in exactly 32 bits and is therefore the easiest, since no variable character length is used and no intelligent algorithm is required, but at the expense of the memory size - if only characters from the ASCII character set are used, this is more than four times the case requires a lot of memory space as with coding in ASCII (7 bits required). Depending on the sequence of bytes, whether the least significant or the most significant byte is transmitted first, we speak of little endian (UTF-32LE) or big endian (UTF-32BE).
  • UTF-16 is the oldest coding method in which one or two 16-bit units (2 or 4 bytes) are used to encode a character. Here, too, a distinction is made between the more common UTF-16LE and UTF-16BE, depending on the sequence of bytes. For languages ​​with non-Latin characters, this is the space-saving variant, as they usually get by with 2 bytes.
  • UTF-8 encodes characters with a variable number of bytes. A Unicode character is encoded in 1 to 4 bytes. The code points 0 to 127, which correspond to the ASCII character set, are encoded in a byte, with the most significant bit always being 0. The eighth bit can be used to introduce a longer Unicode character that extends to 2, 3 or 4 bytes. This is the most efficient way to use memory for fonts based on the Latin alphabet.

All standards can be transmitted or stored with or without a unique signature at the beginning, the byte order mark (BOM). The BOM helps with correct identification, especially when processing files with different programs and on different systems. If everything is clearly defined beforehand or if the information is transmitted differently (for example through the “charset” meta information in an HTML document), the BOM is omitted.

Examples

As an example, the expression "change" in different languages ​​/ fonts and encodings, like in a hex editor . Preceded by the ISO language code and a colon, as is also used for the Interwikilinks in the articles here.

Veränderung (de)
56 00 00 00|65 00 00 00|72 00 00 00|E4 00 00 00|6E 00 00 00|64 00 00 00    | UTF-32LE ↵
00 00 00 56|00 00 00 65|00 00 00 72|00 00 00 E4|00 00 00 6E|00 00 00 64    | UTF-32BE ↵
V          |e          |r          |ä          |n          |d              | Veränd  ↵
65 00 00 00|72 00 00 00|75 00 00 00|6E 00 00 00|67 00 00 00                | UTF-32LE
00 00 00 65|00 00 00 72|00 00 00 75|00 00 00 6E|00 00 00 67                | UTF-32BE
e          |r          |u          |n          |g                          | erung
56 00|65 00|72 00|E4 00|6E 00|64 00|65 00|72 00|75 00|6E 00|67 00          | UTF-16LE
00 56|00 65|00 72|00 E4|00 6E|00 64|00 65|00 72|00 75|00 6E|00 67          | UTF-16BE
V    |e    |r    |ä    |n    |d    |e    |r    |u    |n    |g              | Veränderung
56|65|72|C3 A4|6E|64|65|72|75|6E|67                                        | UTF-8
V |e |r |ä    |n |d |e |r |u |n |g                                         | Veränderung
Промена - Mazedonische Sprache mit kyrillischem Alphabet (mk)
1F 04 00 00|40 04 00 00|3E 04 00 00|3C 04 00 00 | UTF-32LE ↵
00 00 04 1F|00 00 04 40|00 00 04 3E|00 00 04 3C | UTF-32BE ↵
П          |р          |о          |м           | Пром    ↵
35 04 00 00|3D 04 00 00|30 04 00 00             | UTF-32LE
00 00 04 35|00 00 04 3D|00 00 04 30             | UTF-32BE
е          |н          |а                       | ена
1F 04|40 04|3E 04|3C 04|35 04|3D 04|30 04       | UTF-16LE
04 1F|04 40|04 3E|04 3C|04 35|04 3D|04 30       | UTF-16BE
П    |р    |о    |м    |е    |н    |а           | Промена
D0 9F|D1 80|D0 BE|D0 BC|D0 B5|D0 BD|D0 B0       | UTF-8
П    |р    |о    |м    |е    |н    |а           | Промена

Nepali uses the alphasyllabic syllabary system Devanagari . A syllable corresponds to a character, whereby a few basic characters are modified by adding vowel characters and result in other syllables. (Similar to how we write an E with an acute acute on the computer, only that the computer converts it into É , a separate character in Unicode. The Nepalese characters are, however, also composed in Unicode. The circle is a placeholder for the basic character with which this extension responds.) So there are two characters that have been modified once or twice. This is in contrast to Chinese, where there are many different syllable characters. There are also modifying Unicode characters in the Hebrew script, for example.

चांजे - Nepali (ne)
1A 09 00 00|3E 09 00 00|02 09 00 00|1C 09 00 00|47 09 00 00  | UTF-32LE
00 00 09 1A|00 00 09 3E|00 00 09 02|00 00 09 1C|00 00 09 47  | UTF-32BE
च           ा           ं         |ज           े            | चांजे
1A 09|3E 09|02 09|1C 09|47 09                                | UTF-16LE
09 1A|09 3E|09 02|09 1C|09 47                                | UTF-16BE
च     ा    ं    |ज     े                                    | चांजे
E0 A4 9A|E0 A4 BE|E0 A4 82|E0 A4 9C|E0 A5 87                 | UTF-8
च        ा       ं       |ज        े                        | चांजे
变化 - Chinesische Sprachen (zh)
D8 53 00 00|16 53 00 00  | UTF-32LE
00 00 53 D8|00 00 53 16  | UTF-32BE
变         |化           | 变化
D8 53|16 53              | UTF-16LE
53 D8|53 16              | UTF-16BE
变   |化                 | zh:变化
E5 8F 98|E5 8C 96        | UTF-8
变      |化              | 变化

Other Unicode encodings

The Unicode standard only defines UTF-32, UTF-16 and UTF-8. In addition, there are other encodings which can also encode all Unicode characters. Some examples are listed below.

UTF-1

UTF-1 was the first 8-bit encoding for Unicode, but it did not catch on due to several weaknesses.

UTF-7

UTF-7 is an outdated format which encodes Unicode characters into printable ASCII characters (which only need the lower 7 bits of a byte, hence the name of the format). The format was intended for the transmission of Unicode texts over 7-bit channels (e.g. e-mail or Usenet ), but it did not catch on. Instead, UTF-8 combined with MIME transfer encoding such as Base 64 or Quoted-printable is usually used for this application , or UTF-8 with an 8-bit channel.

Example: The word oversize in UTF-7 becomes +ANw-bergr+APYA3w-e, which at 19 bytes is somewhat more compact than the 24 bytes required by quoted-printable UTF-8: =C3=9Cbergr=C3=B6=C3=9Fe>.

UTF-EBCDIC

UTF-EBCDIC is a Unicode encoding that on the proprietary 8-bit EBCDIC format of IBM - mainframes building, comparable to UTF-8 on ASCII.

However, it encodes the first 160 characters (65 control characters and 95 graphic characters) in one byte each at the positions customary with EBCDIC, if they exist, and the remaining Unicode stock, analogous to UTF-8, in two to five bytes (or up to seven) for code positions that cannot be represented with UTF-16 and are therefore probably never assigned characters), at positions that are assigned different graphic characters in various EBCDIC code pages. For example, the BOM becomes (hexadecimal) DD 73 66 73(a four-byte sequence). Depending on the code position, the same character is sometimes coded shorter or longer than with UTF-8.

It was developed with the aim of facilitating the processing of Unicode data in existing mainframe applications. In practice, UTF-EBCDIC is only rarely used on mainframes.

EBCDIC based mainframe operating systems like z / OS usually use UTF-16. For example, UTF-16 is supported by components such as DB2, COBOL, PL / I, Java, and the IBM XML Toolkit.

UTF-5, UTF-6, UTF-9 and UTF-18

UTF-5 and UTF-6 were suggestions for use in International Domain Names ( IDN ). Instead, Punycode was standardized. UTF-9 and UTF-18 were an April Fool's joke, but can in principle be implemented on computers with 9-bit bytes.

SCSU

The Standard Compression Scheme for Unicode is a coding that is primarily geared towards a small memory requirement. All Unicode characters can be displayed; one byte per character is sufficient for most languages. In contrast to other encodings, text can be encoded in many different ways. In practice, however, SCSU could not prevail.

CESU-8

CESU-8 (short for Compatibility Encoding Scheme for UTF-16: 8-Bit) is a variant of UTF-8. The codepoint is first expressed in UTF-16, then the result is recoded in UTF-8 as if it were UCS-2 .

GB18030

The GB18030 character encoding is also to be regarded as a Unicode Transformation Format, as it can map all Unicode code points. It was designed with the aim of being compatible with the GBK and GB2312 codes , which it is intended to replace.

Because of this compatibility, the coding is significantly more complex than UTF-8, since the coding is not systematic. It is therefore usually implemented using lookup tables . ASCII characters are encoded in one byte and correspond to normal ASCII coding. Other characters are encoded in two or four bytes, whereby in these multi-byte sequences - in contrast to UTF-8 - the value range of the ASCII characters is also used again.

Web links

Individual evidence

  1. Chapter 3.9 Unicode Encoding Forms . (PDF) unicode.org