UTF-8

from Wikipedia, the free encyclopedia

UTF-8 (abbreviation for 8-bit UCS Transformation Format , where UCS in turn abbreviates Universal Coded Character Set ) is the most widely used coding for Unicode characters (Unicode and UCS are practically identical). The coding was set by Ken Thompson and Rob Pike in September 1992 while working on the Plan 9 operating system. The coding was initially referred to as FSS-UTF ( filesystem safe UTF as opposed to UTF-1 , which does not have this property) in the context of X / Open ; 8 .

UTF-8 is congruent with ASCII in the first 128 characters (indices 0–127) and, with usually only one byte memory requirement for characters in many Western languages, is particularly suitable for encoding English-language texts, which can usually be used without modification non-UTF-8-capable text editors can be edited without impairment, which is one of the reasons for the status of the de facto standard character encoding of the Internet and associated document types. In March 2019, 93.1% of all websites were using UTF-8 and 94.8% of the top 1000.

In other languages, the memory requirement in bytes per character is greater if they differ from the ASCII character set: Even the German umlauts require two bytes, as do Greek or Cyrillic characters. Characters from languages ​​from the Far East and from languages ​​from Africa, on the other hand, occupy up to 4 bytes per character. Since the processing of UTF-8 as a multibyte character string requires more computational effort and more storage space for certain languages ​​due to the necessary analysis of each byte compared to character encodings with a fixed number of bytes per character, other UTF encodings are also used to map Unicode, depending on the application scenario -Zeichensätzen used: Microsoft Windows as the most-used desktop operating system used internally as a compromise between UTF-8 and UTF-32 as UTF-16 Little endian .

General

With UTF-8 coding, each Unicode character is assigned a specially coded character string of variable length. UTF-8 supports character strings up to a length of four bytes , on which - as with all UTF formats - all Unicode characters can be mapped.

UTF-8 is of central importance as a global character encoding on the Internet. The Internet Engineering Task Force requires all new Internet communication protocols that the character encoding be declared and that UTF-8 is one of the supported encodings. The Internet Mail Consortium (IMC) recommends that all e-mail programs display and send UTF-8.

Also with the markup language HTML used in web browsers , UTF-8 is becoming increasingly popular for the representation of language-specific characters and replaces the previously used HTML entities .

properties

  • Multi-byte character coding ( MBCS ) similar to CP950 / CP936 / CP932 (Chinese / Japanese), but without the (at that time important and useful) property that double-width characters are two bytes long
  • 7-bit ASCII is at the same time UTF-8 and highly compatible with previous 8-bit character sets
  • Multi-byte character strings never consist of 7-bit ASCII characters (enables processing and parsing with common 7-bit character constants)
  • Compared to UTF-16, relatively compact with a high proportion of ASCII characters, but more space-intensive for characters between U + 0800 and U + FFFF (especially Asian languages, see list of Unicode blocks )
  • Sortability is retained, two UTF-8 character strings have the same sorting order as two uncoded Unicode character strings
  • Searchable in both directions (not the case with previous MBCS)
  • Simple transcoding function (also easy to implement in hardware)
  • Plenty of coding reserve (in case something changes in the Unicode standard)

standardization

UTF-8 is currently defined identically by the IETF , the Unicode Consortium and the ISO in the standard documents:

  • RFC 3629 / STD 63 (2003)
  • The Unicode Standard, Version 4.0 , §3.9 – §3.10 (2003)
  • ISO / IEC 10646-1: 2000 Annex D (2000)

These replace older, partly different definitions that are partly still used by older software:

  • ISO / IEC 10646-1: 1993 Amendment 2 / Annex R (1996)
  • The Unicode Standard, Version 2.0 , Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0 , §2.3 (2000) and Corrigendum # 1: UTF-8 Shortest Form (2000)
  • Unicode Standard Annex # 27: Unicode 3.1 (2001)

Coding

algorithm

Unicode characters with values ​​in the range from 0 to 127 (0 to 7F hexadecimal) are reproduced in UTF-8 encoding as one byte with the same value. Therefore, all data for which only real ASCII characters are used are identical in both representations.

Unicode characters greater than 127 are encoded in UTF-8 encoding to form byte strings two to four in length.

Unicode area ( hexadecimal ) UTF-8 encoding ( binary , scheme) Algorithm / explanations Number of characters that can be coded
0000 0000 - 0000 007F 0xxxxxxx In this area (128 characters), UTF-8 corresponds exactly to the ASCII code: The highest bit is 0 , the remaining 7-bit combination is the ASCII character. 2 7 128
0000 0080 - 0000 07FF 110xxxxx 10xxxxxx The first byte always begins with 11, the following bytes with 10. The xxxxx stand for the bits of the Unicode character value. The least significant bit of the character value is mapped to the right x in the last byte, the more significant bits progressing from right to left . The number of ones before the first 0 in the first byte is equal to the total number of bytes for the character. ( On the right in brackets the theoretically maximum possible number of encodable characters, which, however, may not be used in full due to restrictions in the Unicode or UTF-8 standard.) 2 11  - 2 7
(2 11 )
1920
(2048)
0000 0800 - 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 2 16  - 2 11
(2 16 )
63,488
(65,536)
0001 0000 - 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 2 20
(2 21 )
1,048,576
(2,097,152)

Remarks

The algorithm theoretically allows up to eight bytes long byte strings and thus over four billion characters. The last level would contain 11111111 as the first byte and then seven subsequent bytes with six useful bits each. The entire code sequence would then be 2 (7 * 6) = 2 42 = 4,398,046,511,104 characters. Originally a sequence of a first byte with up to 1111110x and thus five sequence bytes in the form 10xxxxxx was originally defined, i.e. a total of six bytes with a total of 31 bits for the contained Unicode value. In its use as UTF coding , however, it is limited to the common code space of all Unicode codings, i.e. from 0 to 0010 FFFF (1,114,112 possibilities) and has a byte chain of up to four bytes long. The available range of values ​​for the character code is ultimately not fully used. Correspondingly long byte sequences and large values ​​are now considered impermissible codes and must be treated accordingly.

The first byte of a UTF-8-coded character is called the start byte , further bytes are called the continuation bytes . Start bytes always begin with 0 or 11, follow-up bytes always with 10.

  • If the highest bit of the first byte is 0, it is an ASCII character, since ASCII is a 7-bit coding and the first 128 Unicode characters correspond to the ASCII characters. This means that all ASCII character strings are automatically upwardly compatible with UTF-8.
  • If the highest bit of the first byte is 1, it is a multi-byte character, i.e. a Unicode character with a character number greater than 127.
  • If the highest two bits of a byte are 11, it is the start byte of a multi-byte character; if they are 10, it is a subsequent byte.
  • The lexical order according to byte values ​​corresponds to the lexical order according to character numbers, since higher character numbers are coded with correspondingly more 1-bits in the start byte.
  • In the case of the start bytes of multibyte characters, the number of the highest 1 bits indicates the total number of bytes of the Unicode character encoded as multibyte characters. Interpreted differently, the number of 1-bits to the left of the highest 0-bit corresponds to the number of subsequent bytes plus one, e.g. B. 1110xxxx 10xxxxxx 10xxxxxx = three bits before the highest 0-bit = three bytes in total, two bits after the highest 1-bit before the highest 0-bit = two subsequent bytes.
  • Start bytes (0… or 11…) and subsequent bytes (10…) can be clearly distinguished from one another. This means that a byte stream can also be read in the middle without any problems with decoding, which is particularly important when restoring defective data. Bytes starting with 10 are simply skipped until 0… or 11… is recognized. The advantage of UTF-8 encoding is that start bytes and subsequent bytes are clearly differentiated from one another. With encodings without this property, reading a data stream whose beginning is unknown may not be possible.

Note:

  • Theoretically, the same character can be encoded in different ways (for example “a” as 0 1100001 or incorrectly as 110 0000 1 10 100001 ). However, only the shortest possible coding is allowed. This fact has led to problems several times, with programs crashing if the encodings are invalid, interpreting them as valid or simply ignoring them. The combination of the last two behaviors resulted in e.g. B. to firewalls that do not recognize dangerous content due to the invalid coding, but the client to be protected interprets this coding as valid and is thereby endangered.
  • If there are several bytes for a character, the bits are aligned - the lowest bit ( least significant bit ) of the Unicode character is always in the lowest bit of the last UTF-8 byte.
  • Originally there were also encodings with more than four octets (up to six), but these have been excluded because there are no corresponding characters in Unicode and ISO 10646 has been adapted to Unicode in its possible character range.
  • For all fonts based on the Latin alphabet , UTF-8 is a particularly space-saving method for mapping Unicode characters.
  • The Unicode ranges U + D800 to U + DBFF and U + to U + DFFF are expressly no signs, but are in DC00 UTF-16 to encode characters outside the Basic Multilingual Plane , they were formerly known as Low and High surrogates called . Consequently, byte strings corresponding to these areas are not valid UTF-8. For example, U + 10400 is represented in UTF-16 as D801, DC00, but in UTF-8 it should be expressed as F0.90.90.80 rather than ED, A0.81, ED, B0.80. Java supports this since version 1.5. Due to the widespread use of the wrong coding, especially in databases, this coding was subsequently standardized as CESU-8 .
  • In UTF-8, UTF-16 and UTF-32 , the entire range of values ​​is encoded in Unicode.
  • If a byte sequence cannot be interpreted as UTF-8 characters, it is usually replaced by the Unicode replacement character U + FFFD or EF, BF, BD when reading .

Permitted bytes and their meaning

Due to the UTF-8 coding rule, certain byte values ​​are not permitted. The following table lists all 256 options and their use and validity. Byte values ​​in red lines are not permitted, green describes permitted byte values ​​that directly represent a character. Those values ​​are highlighted in blue which begin a sequence of two or more bytes and are continued as a sequence with the byte values ​​from lines highlighted in orange.

UTF-8 range of values meaning
Binary Hexadecimal Decimal
00000000-01111111 00-7F 0-127 One-byte characters, congruent with US-ASCII.
10000000-10111111 80-BF 128-191 Second, third or fourth byte of a byte sequence.
11000000-11000001 C0-C1 192-193 Start of a 2-byte sequence that maps the code range from 0 to 127 is not permitted
11000010-11011111 C2-DF 194-223 Start of a 2 byte long sequence (U + 0080… U + 07FF)
Start byte covered code area
C2 U + 0080… U + 00BF
C3 U + 00C0… U + 00FF
C4 U + 0100… U + 013F
C5 U + 0140… U + 017F
C6 U + 0180… U + 01BF
C7 U + 01C0… U + 01FF
C8 U + 0200… U + 023F
C9 U + 0240… U + 027F
CA U + 0280… U + 02BF
CB U + 02C0… U + 02FF
CC U + 0300… U + 033F
CD U + 0340… U + 027F
CE U + 0380… U + 03BF
CF U + 03C0… U + 03FF
D0 U + 0400… U + 043F
D1 U + 0440… U + 047F
D2 U + 0480… U + 04BF
D3 U + 04C0… U + 04FF
D4 U + 0500… U + 053F
D5 U + 0540… U + 057F
D6 U + 0580… U + 05BF
D7 U + 05C0… U + 05FF
D8 U + 0600… U + 063F
D9 U + 0640… U + 067F
THERE U + 0680… U + 06BF
DB U + 06C0… U + 06FF
DC U + 0700… U + 073F
DD U + 0740… U + 077F
DE U + 0780… U + 07BF
DF U + 07C0… U + 07FF
11100000-11101111 E0-EF 224-239 Start of a 3 byte long sequence (U + 0800… U + FFFF)
Start byte covered code area annotation
E0 U + 0800… U + 0FFF 2nd byte:
80… 9F impermissible coding for U + 0000… U + 07FF
A0… BF U + 0800… U + 0FFF
E1 U + 1000… U + 1FFF
E2 U + 2000… U + 2FFF
E3 U + 3000… U + 3FFF
E4 U + 4000… U + 4FFF
E5 U + 5000… U + 5FFF
E6 U + 6000… U + 6FFF
E7 U + 7000… U + 7FFF
E8 U + 8000… U + 8FFF
E9 U + 9000… U + 9FFF
EA U + A000… U + AFFF
EB U + B000… U + BFFF
EC U + C000… U + CFFF
ED U + D000… U + DFFF 2nd byte:
80… 9F U + D000… U + D7FF
A0… BF inadmissible! See CESU-8
EE U + E000… U + EFFF ( Private Use Zone )
EF U + F000… U + FFFF (Private Use Zone, if the 2nd byte is in the range 80 ... A3)
11110000-11110100 F0-F4 240-244 Start of a 4 byte long sequence (including the invalid code areas from 110000to 13FFFF)
Start byte covered code area
F0 U + 10000… U + 3FFFF (2nd byte must be from area 90… BF, where B0… BF corresponds to level 3 that has not been used until now)
F1 U + 40000… U + 7FFFF (currently no valid characters in this area)
F2 U + 80000… U + BFFFF (currently no valid characters in this area)
F3 U + C0000… U + FFFFF
F4 U + 100000… U + 10FFFF (2nd byte must be from range 80… 8F!)
11110101-11110111 F5-F7 245-247 Invalid according to RFC 3629 : Start of a 4-byte sequence for code area above140000
11111000-11111011 F8-FB 248-251 Invalid according to RFC 3629 : Start of a 5-byte sequence
11111100-11111101 FC-FD 252-253 Invalid according to RFC 3629 : Start of a 6 byte long sequence
11111110-11111111 FE-FF 254-255 Invalid. Not defined in the original UTF-8 specification.
code … 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F
0 ... NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1… DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2… SP ! " # $ % & ' ( ) * + , - . /
3… 0 1 2 3 4th 5 6th 7th 8th 9 : ; < = > ?
4… @ A. B. C. D. E. F. G H I. J K L. M. N O
5… P Q R. S. T U V W. X Y Z [ \ ] ^ _
6… ` a b c d e f G H i j k l m n O
7… p q r s t u v w x y z { | } ~ DEL
8th… Second, third or fourth byte of a byte sequence.
9 ...
A ...
B ...
C ... Start of a 2 byte long sequence.
D ...
E ... Start of a 3 byte long sequence.
F ... Start of a 4 byte long sequence.
… 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F

Examples

Some coding examples for UTF-8 are given in the following table:

Examples of UTF-8 encodings
character Unicode Unicode binary UTF-8 binary UTF-8 hexadecimal
Letter y U + 0079 00000000 0 1111001 0 1111001 79
Letter a U + 00E4 00000 000 11 100 100 110 00011 10 100 100 C3 A4
Sign for registered trademark ® U + 00AE 00000 000 10 101110 110 00010 10 101110 C2 AE
Euro sign U + 20AC 0010 0000 10 101100 1110 0010 10 000010 10 101100 E2 82 AC
Treble clef ? U + 1D11E 000 000 01 1101 0001 00 011110 11110 000 10 011101 10 000 100 10 011110 F0 9D 84 9E

The last example lies outside the code area (16 bit) originally contained in Unicode (under version 2.0), which is contained in the current Unicode version as a BMP area (level 0) . Since many fonts do not yet contain these new Unicode areas, the characters contained there cannot be displayed correctly on many platforms. Instead, a replacement character is shown, which serves as a placeholder.

Representation in editors

Byte Order Mark

Although UTF-8 due to the type of coding principle can not occur the problem of different byte orders, some programs add a byte order mark (BOM, German  byte order mark ) at the beginning of the file from UTF-8 files. The BOM consists of the byte sequence EF BB BF , which usually appears in non-UTF-8-capable text editors and browsers as the ISO-8859-1 character sequence ï »¿and can be responsible for compatibility problems.

Characters not in the Basic Latin Unicode block

The letters of the Latin basic alphabet and the most important punctuation marks are displayed identically in UTF-8 and ISO-8859- * . Problems with the wrongly chosen character encoding occur with the other characters, for example umlauts . In German-language texts, however, these characters only appear sporadically, so that the text appears badly distorted, but mostly remains legible.

In UTF-8, the umlauts of the German alphabet (if they are in the normal form NFC , i.e. as precomposed characters ) and the ß consist of two bytes; according to ISO 8859, each character is encoded as 1 byte and each byte is transformed into a character when reading. The first byte C3 hex , which is common in the UTF-8 coding of these letters , is decoded differently, as can be seen in the table, as is the other byte of the coding from äöü, but with ÄÖÜß the second byte is not or with the same error Characters are shown because 7F hex to 9F hex are not defined in ISO 8859, which makes the text more difficult to read.

When interpreting a text encoded in ISO-8859 as UTF-8, the letters öü lead to the display of a replacement character because the corresponding byte value, as shown in the table below, is not defined. A start byte is assumed for the letters äöüß and attempts to interpret the next byte as a subsequent byte together as one character. Of course, this often fails because the coding of most of the letters is not valid subsequent bytes. In the case of an ä, an attempt is even made to interpret the next two bytes as subsequent bytes, which regularly fails for the same reasons. Depending on the programming of the displaying program, a corresponding number of letters may disappear from the text.

UTF-8 text opened with a different encoding:
UTF-8 ISO-8859-1 ISO-8859-15 UTF16
U + 00E4 C3A4 hex Ä Ä Ã €
U + 00F6 C3B6 hex ö ö ö
U + 00FC C3BC hex ü ü ÃŒ
U + 00DF C39F hex ß Ã ?? Ã ??
U + 00C4 C384 hex Ä Ã ?? Ã ??
U + 00D6 C396 hex Ö Ã ?? Ã ??
U + 00DC C39C hex Ü Ã ?? Ã
ISO Latin 1 2 3 4th 5 6th 7th 8th 9 10 UTF-8
ISO / IEC 8859- 1 2 3 4th 9 10 13 14th 15th 16
1010 0100 244 164 A4 ¤ ¤ Ī ¤ Ċ Next byte +24
1011 0110 266 182 B6 ś H ļ ķ Next byte +36
1011 1100 274 188 BC ¼ ź ĵ ŧ ¼ ž ¼ Œ Next byte + 3C
1100 0011 303 195 C3 Ã Ă   Ã Ć Ã Ă Start byte Latin 0080
1100 0100 304 196 C4 Ä Start byte Latin 00C0
1101 0110 326 214 D6 Ö Start byte Hebrew 0580
1101 1100 334 220 DC Ü Start byte Syriac 0700
1101 1111 337 223 DF ß Start byte N'Ko 07C0
1110 0100 344 228 E4 Ä Start byte Kana 3000
1111 0110 366 246 F6 ö inadmissible
1111 1100 374 252 FC ü inadmissible
Am Oct Dec Hex ISO-Latin- ISO / IEC 8859- UTF-8

An example of the word height :

UTF-8 text in ISO-8859-1 / 9 / 13-16 environment
Heightheight . ; ISO-8859-1 text in UTF-8 environment
HeightH he or error message with abort. A byte with the hexadecimal value F6 is not allowed in UTF-8. It is common practice to insert the replacement character (U + FFFD) for non-convertible characters .

Web links

Wiktionary: UTF-8  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. RFC 3629 UTF-8, a transformation format of ISO 10646. Chapter 1 (Introduction), English.
  2. Historical trends in the usage of character encodings for websites. In: W3Techs. Q-Success, accessed on March 5, 2019 .
  3. Usage of character encodings broken down by ranking. In: W3Techs. Q-Success, accessed March 7, 2019 .
  4. Using International Characters in Internet Mail. ( Memento of October 26, 2007 in the Internet Archive ) Internet Mail Consortium, August 1, 1998, accessed July 12, 2012.
  5. Usage of character encodings for websites. In: W3Techs. Q-Success, accessed on July 12, 2012 (English, March 14, 2012).
  6. Norbert Lindenberg, Masayoshi Okutsu: Supplementary Characters in the Java Platform. In: Oracle website. Sun Microsystems, May 2004, accessed June 9, 2019 .