UTF-8
UTF-8 (abbreviation for 8-bit UCS Transformation Format , where UCS in turn abbreviates Universal Coded Character Set ) is the most widely used coding for Unicode characters (Unicode and UCS are practically identical). The coding was set by Ken Thompson and Rob Pike in September 1992 while working on the Plan 9 operating system. The coding was initially referred to as FSS-UTF ( filesystem safe UTF as opposed to UTF-1 , which does not have this property) in the context of X / Open ; 8 .
UTF-8 is congruent with ASCII in the first 128 characters (indices 0–127) and, with usually only one byte memory requirement for characters in many Western languages, is particularly suitable for encoding English-language texts, which can usually be used without modification non-UTF-8-capable text editors can be edited without impairment, which is one of the reasons for the status of the de facto standard character encoding of the Internet and associated document types. In March 2019, 93.1% of all websites were using UTF-8 and 94.8% of the top 1000.
In other languages, the memory requirement in bytes per character is greater if they differ from the ASCII character set: Even the German umlauts require two bytes, as do Greek or Cyrillic characters. Characters from languages from the Far East and from languages from Africa, on the other hand, occupy up to 4 bytes per character. Since the processing of UTF-8 as a multibyte character string requires more computational effort and more storage space for certain languages due to the necessary analysis of each byte compared to character encodings with a fixed number of bytes per character, other UTF encodings are also used to map Unicode, depending on the application scenario -Zeichensätzen used: Microsoft Windows as the most-used desktop operating system used internally as a compromise between UTF-8 and UTF-32 as UTF-16 Little endian .
General
With UTF-8 coding, each Unicode character is assigned a specially coded character string of variable length. UTF-8 supports character strings up to a length of four bytes , on which - as with all UTF formats - all Unicode characters can be mapped.
UTF-8 is of central importance as a global character encoding on the Internet. The Internet Engineering Task Force requires all new Internet communication protocols that the character encoding be declared and that UTF-8 is one of the supported encodings. The Internet Mail Consortium (IMC) recommends that all e-mail programs display and send UTF-8.
Also with the markup language HTML used in web browsers , UTF-8 is becoming increasingly popular for the representation of language-specific characters and replaces the previously used HTML entities .
properties
- Multi-byte character coding ( MBCS ) similar to CP950 / CP936 / CP932 (Chinese / Japanese), but without the (at that time important and useful) property that double-width characters are two bytes long
- 7-bit ASCII is at the same time UTF-8 and highly compatible with previous 8-bit character sets
- Multi-byte character strings never consist of 7-bit ASCII characters (enables processing and parsing with common 7-bit character constants)
- Compared to UTF-16, relatively compact with a high proportion of ASCII characters, but more space-intensive for characters between U + 0800 and U + FFFF (especially Asian languages, see list of Unicode blocks )
- Sortability is retained, two UTF-8 character strings have the same sorting order as two uncoded Unicode character strings
- Searchable in both directions (not the case with previous MBCS)
- Simple transcoding function (also easy to implement in hardware)
- Plenty of coding reserve (in case something changes in the Unicode standard)
standardization
UTF-8 is currently defined identically by the IETF , the Unicode Consortium and the ISO in the standard documents:
- RFC 3629 / STD 63 (2003)
- The Unicode Standard, Version 4.0 , §3.9 – §3.10 (2003)
- ISO / IEC 10646-1: 2000 Annex D (2000)
These replace older, partly different definitions that are partly still used by older software:
- ISO / IEC 10646-1: 1993 Amendment 2 / Annex R (1996)
- The Unicode Standard, Version 2.0 , Appendix A (1996)
- RFC 2044 (1996)
- RFC 2279 (1998)
- The Unicode Standard, Version 3.0 , §2.3 (2000) and Corrigendum # 1: UTF-8 Shortest Form (2000)
- Unicode Standard Annex # 27: Unicode 3.1 (2001)
Coding
algorithm
Unicode characters with values in the range from 0 to 127 (0 to 7F hexadecimal) are reproduced in UTF-8 encoding as one byte with the same value. Therefore, all data for which only real ASCII characters are used are identical in both representations.
Unicode characters greater than 127 are encoded in UTF-8 encoding to form byte strings two to four in length.
Unicode area ( hexadecimal ) | UTF-8 encoding ( binary , scheme) | Algorithm / explanations | Number of characters that can be coded | |
---|---|---|---|---|
0000 0000 - 0000 007F | 0xxxxxxx | In this area (128 characters), UTF-8 corresponds exactly to the ASCII code: The highest bit is 0 , the remaining 7-bit combination is the ASCII character. | 2 7 | 128 |
0000 0080 - 0000 07FF | 110xxxxx 10xxxxxx | The first byte always begins with 11, the following bytes with 10. The xxxxx stand for the bits of the Unicode character value. The least significant bit of the character value is mapped to the right x in the last byte, the more significant bits progressing from right to left . The number of ones before the first 0 in the first byte is equal to the total number of bytes for the character. ( On the right in brackets the theoretically maximum possible number of encodable characters, which, however, may not be used in full due to restrictions in the Unicode or UTF-8 standard.) | 2 11 - 2 7 (2 11 ) |
1920 (2048) |
0000 0800 - 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 2 16 - 2 11 (2 16 ) |
63,488 (65,536) |
|
0001 0000 - 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 2 20 (2 21 ) |
1,048,576 (2,097,152) |
Remarks
The algorithm theoretically allows up to eight bytes long byte strings and thus over four billion characters. The last level would contain 11111111 as the first byte and then seven subsequent bytes with six useful bits each. The entire code sequence would then be 2 (7 * 6) = 2 42 = 4,398,046,511,104 characters. Originally a sequence of a first byte with up to 1111110x and thus five sequence bytes in the form 10xxxxxx was originally defined, i.e. a total of six bytes with a total of 31 bits for the contained Unicode value. In its use as UTF coding , however, it is limited to the common code space of all Unicode codings, i.e. from 0 to 0010 FFFF (1,114,112 possibilities) and has a byte chain of up to four bytes long. The available range of values for the character code is ultimately not fully used. Correspondingly long byte sequences and large values are now considered impermissible codes and must be treated accordingly.
The first byte of a UTF-8-coded character is called the start byte , further bytes are called the continuation bytes . Start bytes always begin with 0 or 11, follow-up bytes always with 10.
- If the highest bit of the first byte is 0, it is an ASCII character, since ASCII is a 7-bit coding and the first 128 Unicode characters correspond to the ASCII characters. This means that all ASCII character strings are automatically upwardly compatible with UTF-8.
- If the highest bit of the first byte is 1, it is a multi-byte character, i.e. a Unicode character with a character number greater than 127.
- If the highest two bits of a byte are 11, it is the start byte of a multi-byte character; if they are 10, it is a subsequent byte.
- The lexical order according to byte values corresponds to the lexical order according to character numbers, since higher character numbers are coded with correspondingly more 1-bits in the start byte.
- In the case of the start bytes of multibyte characters, the number of the highest 1 bits indicates the total number of bytes of the Unicode character encoded as multibyte characters. Interpreted differently, the number of 1-bits to the left of the highest 0-bit corresponds to the number of subsequent bytes plus one, e.g. B. 1110xxxx 10xxxxxx 10xxxxxx = three bits before the highest 0-bit = three bytes in total, two bits after the highest 1-bit before the highest 0-bit = two subsequent bytes.
- Start bytes (0… or 11…) and subsequent bytes (10…) can be clearly distinguished from one another. This means that a byte stream can also be read in the middle without any problems with decoding, which is particularly important when restoring defective data. Bytes starting with 10 are simply skipped until 0… or 11… is recognized. The advantage of UTF-8 encoding is that start bytes and subsequent bytes are clearly differentiated from one another. With encodings without this property, reading a data stream whose beginning is unknown may not be possible.
Note:
- Theoretically, the same character can be encoded in different ways (for example “a” as 0 1100001 or incorrectly as 110 0000 1 10 100001 ). However, only the shortest possible coding is allowed. This fact has led to problems several times, with programs crashing if the encodings are invalid, interpreting them as valid or simply ignoring them. The combination of the last two behaviors resulted in e.g. B. to firewalls that do not recognize dangerous content due to the invalid coding, but the client to be protected interprets this coding as valid and is thereby endangered.
- If there are several bytes for a character, the bits are aligned - the lowest bit ( least significant bit ) of the Unicode character is always in the lowest bit of the last UTF-8 byte.
- Originally there were also encodings with more than four octets (up to six), but these have been excluded because there are no corresponding characters in Unicode and ISO 10646 has been adapted to Unicode in its possible character range.
- For all fonts based on the Latin alphabet , UTF-8 is a particularly space-saving method for mapping Unicode characters.
- The Unicode ranges U + D800 to U + DBFF and U + to U + DFFF are expressly no signs, but are in DC00 UTF-16 to encode characters outside the Basic Multilingual Plane , they were formerly known as Low and High surrogates called . Consequently, byte strings corresponding to these areas are not valid UTF-8. For example, U + 10400 is represented in UTF-16 as D801, DC00, but in UTF-8 it should be expressed as F0.90.90.80 rather than ED, A0.81, ED, B0.80. Java supports this since version 1.5. Due to the widespread use of the wrong coding, especially in databases, this coding was subsequently standardized as CESU-8 .
- In UTF-8, UTF-16 and UTF-32 , the entire range of values is encoded in Unicode.
- If a byte sequence cannot be interpreted as UTF-8 characters, it is usually replaced by the Unicode replacement character U + FFFD or EF, BF, BD when reading .
Permitted bytes and their meaning
Due to the UTF-8 coding rule, certain byte values are not permitted. The following table lists all 256 options and their use and validity. Byte values in red lines are not permitted, green describes permitted byte values that directly represent a character. Those values are highlighted in blue which begin a sequence of two or more bytes and are continued as a sequence with the byte values from lines highlighted in orange.
UTF-8 range of values | meaning | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Binary | Hexadecimal | Decimal | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
00000000-01111111 | 00-7F | 0-127 | One-byte characters, congruent with US-ASCII. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10000000-10111111 | 80-BF | 128-191 | Second, third or fourth byte of a byte sequence. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11000000-11000001 | C0-C1 | 192-193 | Start of a 2-byte sequence that maps the code range from 0 to 127 is not permitted | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11000010-11011111 | C2-DF | 194-223 | Start of a 2 byte long sequence (U + 0080… U + 07FF)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11100000-11101111 | E0-EF | 224-239 | Start of a 3 byte long sequence (U + 0800… U + FFFF)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11110000-11110100 | F0-F4 | 240-244 | Start of a 4 byte long sequence (including the invalid code areas from 110000 to 13FFFF )
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11110101-11110111 | F5-F7 | 245-247 | Invalid according to RFC 3629 : Start of a 4-byte sequence for code area above140000
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11111000-11111011 | F8-FB | 248-251 | Invalid according to RFC 3629 : Start of a 5-byte sequence | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11111100-11111101 | FC-FD | 252-253 | Invalid according to RFC 3629 : Start of a 6 byte long sequence | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11111110-11111111 | FE-FF | 254-255 | Invalid. Not defined in the original UTF-8 specification. |
code | … 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 ... | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1… | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2… | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3… | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | : | ; | < | = | > | ? |
4… | @ | A. | B. | C. | D. | E. | F. | G | H | I. | J | K | L. | M. | N | O |
5… | P | Q | R. | S. | T | U | V | W. | X | Y | Z | [ | \ | ] | ^ | _ |
6… | ` | a | b | c | d | e | f | G | H | i | j | k | l | m | n | O |
7… | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
8th… | Second, third or fourth byte of a byte sequence. | |||||||||||||||
9 ... | ||||||||||||||||
A ... | ||||||||||||||||
B ... | ||||||||||||||||
C ... | ||||||||||||||||
D ... | ||||||||||||||||
E ... | Start of a 3 byte long sequence. | |||||||||||||||
F ... | Start of a 4 byte long sequence. | |||||||||||||||
… 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
Examples
Some coding examples for UTF-8 are given in the following table:
character | Unicode | Unicode binary | UTF-8 binary | UTF-8 hexadecimal |
---|---|---|---|---|
Letter y | U + 0079 | 00000000 0 1111001 | 0 1111001 | 79 |
Letter a | U + 00E4 | 00000 000 11 100 100 | 110 00011 10 100 100 | C3 A4 |
Sign for registered trademark ® | U + 00AE | 00000 000 10 101110 | 110 00010 10 101110 | C2 AE |
Euro sign € | U + 20AC | 0010 0000 10 101100 | 1110 0010 10 000010 10 101100 | E2 82 AC |
Treble clef ? | U + 1D11E | 000 000 01 1101 0001 00 011110 | 11110 000 10 011101 10 000 100 10 011110 | F0 9D 84 9E |
The last example lies outside the code area (16 bit) originally contained in Unicode (under version 2.0), which is contained in the current Unicode version as a BMP area (level 0) . Since many fonts do not yet contain these new Unicode areas, the characters contained there cannot be displayed correctly on many platforms. Instead, a replacement character is shown, which serves as a placeholder.
Representation in editors
Byte Order Mark
Although UTF-8 due to the type of coding principle can not occur the problem of different byte orders, some programs add a byte order mark (BOM, German byte order mark ) at the beginning of the file from UTF-8 files. The BOM consists of the byte sequence EF BB BF , which usually appears in non-UTF-8-capable text editors and browsers as the ISO-8859-1 character sequence ï »¿and can be responsible for compatibility problems.
Characters not in the Basic Latin Unicode block
The letters of the Latin basic alphabet and the most important punctuation marks are displayed identically in UTF-8 and ISO-8859- * . Problems with the wrongly chosen character encoding occur with the other characters, for example umlauts . In German-language texts, however, these characters only appear sporadically, so that the text appears badly distorted, but mostly remains legible.
In UTF-8, the umlauts of the German alphabet (if they are in the normal form NFC , i.e. as precomposed characters ) and the ß consist of two bytes; according to ISO 8859, each character is encoded as 1 byte and each byte is transformed into a character when reading. The first byte C3 hex , which is common in the UTF-8 coding of these letters , is decoded differently, as can be seen in the table, as is the other byte of the coding from äöü, but with ÄÖÜß the second byte is not or with the same error Characters are shown because 7F hex to 9F hex are not defined in ISO 8859, which makes the text more difficult to read.
When interpreting a text encoded in ISO-8859 as UTF-8, the letters öü lead to the display of a replacement character because the corresponding byte value, as shown in the table below, is not defined. A start byte is assumed for the letters äöüß and attempts to interpret the next byte as a subsequent byte together as one character. Of course, this often fails because the coding of most of the letters is not valid subsequent bytes. In the case of an ä, an attempt is even made to interpret the next two bytes as subsequent bytes, which regularly fails for the same reasons. Depending on the programming of the displaying program, a corresponding number of letters may disappear from the text.
UTF-8 | ISO-8859-1 | ISO-8859-15 | UTF16 | ||
---|---|---|---|---|---|
U + 00E4 | C3A4 hex | Ä | Ä | Ã € | 쎤 |
U + 00F6 | C3B6 hex | ö | ö | ö | 쎶 |
U + 00FC | C3BC hex | ü | ü | ÃŒ | 쎼 |
U + 00DF | C39F hex | ß | Ã ?? | Ã ?? | 쎟 |
U + 00C4 | C384 hex | Ä | Ã ?? | Ã ?? | 쎄 |
U + 00D6 | C396 hex | Ö | Ã ?? | Ã ?? | 쎖 |
U + 00DC | C39C hex | Ü | Ã ?? | Ã | 쎜 |
ISO Latin | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | UTF-8 | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ISO / IEC 8859- | 1 | 2 | 3 | 4th | 9 | 10 | 13 | 14th | 15th | 16 | |||||
1010 0100 | 244 | 164 | A4 | ¤ | ¤ | Ī | ¤ | Ċ | € | Next byte | +24 | ||||
1011 0110 | 266 | 182 | B6 | ¶ | ś | H | ļ | ¶ | ķ | ¶ | Next byte | +36 | |||
1011 1100 | 274 | 188 | BC | ¼ | ź | ĵ | ŧ | ¼ | ž | ¼ | ỳ | Œ | Next byte | + 3C | |
1100 0011 | 303 | 195 | C3 | Ã | Ă | Ã | Ć | Ã | Ă | Start byte | Latin 0080 | ||||
1100 0100 | 304 | 196 | C4 | Ä | Start byte | Latin 00C0 | |||||||||
1101 0110 | 326 | 214 | D6 | Ö | Start byte | Hebrew 0580 | |||||||||
1101 1100 | 334 | 220 | DC | Ü | Start byte | Syriac 0700 | |||||||||
1101 1111 | 337 | 223 | DF | ß | Start byte | N'Ko 07C0 | |||||||||
1110 0100 | 344 | 228 | E4 | Ä | Start byte | Kana 3000 | |||||||||
1111 0110 | 366 | 246 | F6 | ö | inadmissible | ||||||||||
1111 1100 | 374 | 252 | FC | ü | inadmissible | ||||||||||
Am | Oct | Dec | Hex | ISO-Latin- ISO / IEC 8859- | UTF-8 |
An example of the word height :
- UTF-8 text in ISO-8859-1 / 9 / 13-16 environment
- Height → height . ; ISO-8859-1 text in UTF-8 environment
- Height → H he or error message with abort. A byte with the hexadecimal value F6 is not allowed in UTF-8. It is common practice to insert the replacement character (U + FFFD) for non-convertible characters .
Web links
- RFC 3629 - UTF-8, a transformation format of ISO 10646 (English)
- UTF-8 code table with Unicode characters - UTF-8 coding of all Unicode positions from the BMP with additional information and named HTML entities
- Dieter Pawelczak: Coding of strings. Example UCS / UTF8. In: University of the Federal Armed Forces, Munich. Institute for Software Engineering.
- Pavel Radzivilovsky, Yakov Galka, Slava Novgorodov: UTF-8 Everywhere. Manifesto. (English)
Individual evidence
- ↑ RFC 3629 UTF-8, a transformation format of ISO 10646. Chapter 1 (Introduction), English.
- ↑ Historical trends in the usage of character encodings for websites. In: W3Techs. Q-Success, accessed on March 5, 2019 .
- ↑ Usage of character encodings broken down by ranking. In: W3Techs. Q-Success, accessed March 7, 2019 .
- ↑ Using International Characters in Internet Mail. ( Memento of October 26, 2007 in the Internet Archive ) Internet Mail Consortium, August 1, 1998, accessed July 12, 2012.
- ↑ Usage of character encodings for websites. In: W3Techs. Q-Success, accessed on July 12, 2012 (English, March 14, 2012).
- ↑ Norbert Lindenberg, Masayoshi Okutsu: Supplementary Characters in the Java Platform. In: Oracle website. Sun Microsystems, May 2004, accessed June 9, 2019 .