UTF-8

UTF-8 (abbreviation for 8-bit UCS Transformation Format , where UCS in turn abbreviates Universal Coded Character Set ) is the most widely used coding for Unicode characters (Unicode and UCS are practically identical). The coding was set by Ken Thompson and Rob Pike in September 1992 while working on the Plan 9 operating system. The coding was initially referred to as FSS-UTF ( filesystem safe UTF as opposed to UTF-1 , which does not have this property) in the context of X / Open ; 8 .

UTF-8 is congruent with ASCII in the first 128 characters (indices 0–127) and, with usually only one byte memory requirement for characters in many Western languages, is particularly suitable for encoding English-language texts, which can usually be used without modification non-UTF-8-capable text editors can be edited without impairment, which is one of the reasons for the status of the de facto standard character encoding of the Internet and associated document types. In March 2019, 93.1% of all websites were using UTF-8 and 94.8% of the top 1000.

In other languages, the memory requirement in bytes per character is greater if they differ from the ASCII character set: Even the German umlauts require two bytes, as do Greek or Cyrillic characters. Characters from languages from the Far East and from languages from Africa, on the other hand, occupy up to 4 bytes per character. Since the processing of UTF-8 as a multibyte character string requires more computational effort and more storage space for certain languages due to the necessary analysis of each byte compared to character encodings with a fixed number of bytes per character, other UTF encodings are also used to map Unicode, depending on the application scenario -Zeichensätzen used: Microsoft Windows as the most-used desktop operating system used internally as a compromise between UTF-8 and UTF-32 as UTF-16 Little endian .

General

With UTF-8 coding, each Unicode character is assigned a specially coded character string of variable length. UTF-8 supports character strings up to a length of four bytes , on which - as with all UTF formats - all Unicode characters can be mapped.

UTF-8 is of central importance as a global character encoding on the Internet. The Internet Engineering Task Force requires all new Internet communication protocols that the character encoding be declared and that UTF-8 is one of the supported encodings. The Internet Mail Consortium (IMC) recommends that all e-mail programs display and send UTF-8.

Also with the markup language HTML used in web browsers , UTF-8 is becoming increasingly popular for the representation of language-specific characters and replaces the previously used HTML entities .

properties

Multi-byte character coding ( MBCS ) similar to CP950 / CP936 / CP932 (Chinese / Japanese), but without the (at that time important and useful) property that double-width characters are two bytes long
7-bit ASCII is at the same time UTF-8 and highly compatible with previous 8-bit character sets
Multi-byte character strings never consist of 7-bit ASCII characters (enables processing and parsing with common 7-bit character constants)
Compared to UTF-16, relatively compact with a high proportion of ASCII characters, but more space-intensive for characters between U + 0800 and U + FFFF (especially Asian languages, see list of Unicode blocks )
Sortability is retained, two UTF-8 character strings have the same sorting order as two uncoded Unicode character strings
Searchable in both directions (not the case with previous MBCS)
Simple transcoding function (also easy to implement in hardware)
Plenty of coding reserve (in case something changes in the Unicode standard)

standardization

UTF-8 is currently defined identically by the IETF , the Unicode Consortium and the ISO in the standard documents:

RFC 3629 / STD 63 (2003)
The Unicode Standard, Version 4.0 , §3.9 – §3.10 (2003)
ISO / IEC 10646-1: 2000 Annex D (2000)

These replace older, partly different definitions that are partly still used by older software:

ISO / IEC 10646-1: 1993 Amendment 2 / Annex R (1996)
The Unicode Standard, Version 2.0 , Appendix A (1996)
RFC 2044 (1996)
RFC 2279 (1998)
The Unicode Standard, Version 3.0 , §2.3 (2000) and Corrigendum # 1: UTF-8 Shortest Form (2000)
Unicode Standard Annex # 27: Unicode 3.1 (2001)

Coding

algorithm

Unicode characters with values in the range from 0 to 127 (0 to 7F hexadecimal) are reproduced in UTF-8 encoding as one byte with the same value. Therefore, all data for which only real ASCII characters are used are identical in both representations.

Unicode characters greater than 127 are encoded in UTF-8 encoding to form byte strings two to four in length.

Unicode area ( hexadecimal )	UTF-8 encoding ( binary , scheme)	Algorithm / explanations	Number of characters that can be coded
0000 0000 - 0000 007F	0xxxxxxx	In this area (128 characters), UTF-8 corresponds exactly to the ASCII code: The highest bit is 0 , the remaining 7-bit combination is the ASCII character.	2 ⁷	128
0000 0080 - 0000 07FF	110xxxxx 10xxxxxx	The first byte always begins with 11, the following bytes with 10. The xxxxx stand for the bits of the Unicode character value. The least significant bit of the character value is mapped to the right x in the last byte, the more significant bits progressing from right to left . The number of ones before the first 0 in the first byte is equal to the total number of bytes for the character. ( On the right in brackets the theoretically maximum possible number of encodable characters, which, however, may not be used in full due to restrictions in the Unicode or UTF-8 standard.)	2 ¹¹ - 2 ⁷ (2 ¹¹ )	1920 (2048)
0000 0800 - 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx		2 ¹⁶ - 2 ¹¹ (2 ¹⁶ )	63,488 (65,536)
0001 0000 - 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx		2 ²⁰ (2 ²¹ )	1,048,576 (2,097,152)

Remarks

The algorithm theoretically allows up to eight bytes long byte strings and thus over four billion characters. The last level would contain 11111111 as the first byte and then seven subsequent bytes with six useful bits each. The entire code sequence would then be 2 ^{(7 * 6)} = 2 ⁴² = 4,398,046,511,104 characters. Originally a sequence of a first byte with up to 1111110x and thus five sequence bytes in the form 10xxxxxx was originally defined, i.e. a total of six bytes with a total of 31 bits for the contained Unicode value. In its use as UTF coding , however, it is limited to the common code space of all Unicode codings, i.e. from 0 to 0010 FFFF (1,114,112 possibilities) and has a byte chain of up to four bytes long. The available range of values for the character code is ultimately not fully used. Correspondingly long byte sequences and large values are now considered impermissible codes and must be treated accordingly.

The first byte of a UTF-8-coded character is called the start byte , further bytes are called the continuation bytes . Start bytes always begin with 0 or 11, follow-up bytes always with 10.

If the highest bit of the first byte is 0, it is an ASCII character, since ASCII is a 7-bit coding and the first 128 Unicode characters correspond to the ASCII characters. This means that all ASCII character strings are automatically upwardly compatible with UTF-8.
If the highest bit of the first byte is 1, it is a multi-byte character, i.e. a Unicode character with a character number greater than 127.
If the highest two bits of a byte are 11, it is the start byte of a multi-byte character; if they are 10, it is a subsequent byte.
The lexical order according to byte values corresponds to the lexical order according to character numbers, since higher character numbers are coded with correspondingly more 1-bits in the start byte.
In the case of the start bytes of multibyte characters, the number of the highest 1 bits indicates the total number of bytes of the Unicode character encoded as multibyte characters. Interpreted differently, the number of 1-bits to the left of the highest 0-bit corresponds to the number of subsequent bytes plus one, e.g. B. 1110xxxx 10xxxxxx 10xxxxxx = three bits before the highest 0-bit = three bytes in total, two bits after the highest 1-bit before the highest 0-bit = two subsequent bytes.
Start bytes (0… or 11…) and subsequent bytes (10…) can be clearly distinguished from one another. This means that a byte stream can also be read in the middle without any problems with decoding, which is particularly important when restoring defective data. Bytes starting with 10 are simply skipped until 0… or 11… is recognized. The advantage of UTF-8 encoding is that start bytes and subsequent bytes are clearly differentiated from one another. With encodings without this property, reading a data stream whose beginning is unknown may not be possible.

Note:

Theoretically, the same character can be encoded in different ways (for example “a” as 0 1100001 or incorrectly as 110 0000 1 10 100001 ). However, only the shortest possible coding is allowed. This fact has led to problems several times, with programs crashing if the encodings are invalid, interpreting them as valid or simply ignoring them. The combination of the last two behaviors resulted in e.g. B. to firewalls that do not recognize dangerous content due to the invalid coding, but the client to be protected interprets this coding as valid and is thereby endangered.

If there are several bytes for a character, the bits are aligned - the lowest bit ( least significant bit ) of the Unicode character is always in the lowest bit of the last UTF-8 byte.

Originally there were also encodings with more than four octets (up to six), but these have been excluded because there are no corresponding characters in Unicode and ISO 10646 has been adapted to Unicode in its possible character range.

For all fonts based on the Latin alphabet , UTF-8 is a particularly space-saving method for mapping Unicode characters.

The Unicode ranges U + D800 to U + DBFF and U + to U + DFFF are expressly no signs, but are in DC00 UTF-16 to encode characters outside the Basic Multilingual Plane , they were formerly known as Low and High surrogates called . Consequently, byte strings corresponding to these areas are not valid UTF-8. For example, U + 10400 is represented in UTF-16 as D801, DC00, but in UTF-8 it should be expressed as F0.90.90.80 rather than ED, A0.81, ED, B0.80. Java supports this since version 1.5. Due to the widespread use of the wrong coding, especially in databases, this coding was subsequently standardized as CESU-8 .

In UTF-8, UTF-16 and UTF-32 , the entire range of values is encoded in Unicode.

If a byte sequence cannot be interpreted as UTF-8 characters, it is usually replaced by the Unicode replacement character U + FFFD or EF, BF, BD when reading .

Permitted bytes and their meaning

Due to the UTF-8 coding rule, certain byte values are not permitted. The following table lists all 256 options and their use and validity. Byte values in red lines are not permitted, green describes permitted byte values that directly represent a character. Those values are highlighted in blue which begin a sequence of two or more bytes and are continued as a sequence with the byte values from lines highlighted in orange.

UTF-8 range of values meaning

Binary Hexadecimal Decimal

00000000-01111111 00-7F 0-127 One-byte characters, congruent with US-ASCII.

10000000-10111111 80-BF 128-191 Second, third or fourth byte of a byte sequence.

11000000-11000001 C0-C1 192-193 Start of a 2-byte sequence that maps the code range from 0 to 127 is not permitted

11000010-11011111

C2-DF

194-223

Start of a 2 byte long sequence (U + 0080… U + 07FF)

Start byte	covered code area
C2	U + 0080… U + 00BF
C3	U + 00C0… U + 00FF
C4	U + 0100… U + 013F
C5	U + 0140… U + 017F
C6	U + 0180… U + 01BF
C7	U + 01C0… U + 01FF
C8	U + 0200… U + 023F
C9	U + 0240… U + 027F
CA	U + 0280… U + 02BF
CB	U + 02C0… U + 02FF
CC	U + 0300… U + 033F
CD	U + 0340… U + 027F
CE	U + 0380… U + 03BF
CF	U + 03C0… U + 03FF
D0	U + 0400… U + 043F
D1	U + 0440… U + 047F
D2	U + 0480… U + 04BF
D3	U + 04C0… U + 04FF
D4	U + 0500… U + 053F
D5	U + 0540… U + 057F
D6	U + 0580… U + 05BF
D7	U + 05C0… U + 05FF
D8	U + 0600… U + 063F
D9	U + 0640… U + 067F
THERE	U + 0680… U + 06BF
DB	U + 06C0… U + 06FF
DC	U + 0700… U + 073F
DD	U + 0740… U + 077F
DE	U + 0780… U + 07BF
DF	U + 07C0… U + 07FF

11100000-11101111

E0-EF

224-239

Start of a 3 byte long sequence (U + 0800… U + FFFF)

Start byte

covered code area

annotation

E0

U + 0800… U + 0FFF

2nd byte:

80… 9F	impermissible coding for U + 0000… U + 07FF
A0… BF	U + 0800… U + 0FFF

E1

U + 1000… U + 1FFF

E2

U + 2000… U + 2FFF

E3

U + 3000… U + 3FFF

E4

U + 4000… U + 4FFF

E5

U + 5000… U + 5FFF

E6

U + 6000… U + 6FFF

E7

U + 7000… U + 7FFF

E8

U + 8000… U + 8FFF

E9

U + 9000… U + 9FFF

EA

U + A000… U + AFFF

EB

U + B000… U + BFFF

EC

U + C000… U + CFFF

ED

U + D000… U + DFFF

2nd byte:

80… 9F	U + D000… U + D7FF
A0… BF	inadmissible! See CESU-8

EE

U + E000… U + EFFF

( Private Use Zone )

EF

U + F000… U + FFFF

(Private Use Zone, if the 2nd byte is in the range 80 ... A3)

11110000-11110100

F0-F4

240-244

Start of a 4 byte long sequence (including the invalid code areas from 110000to 13FFFF)

Start byte	covered code area
F0	U + 10000… U + 3FFFF (2nd byte must be from area 90… BF, where B0… BF corresponds to level 3 that has not been used until now)
F1	U + 40000… U + 7FFFF (currently no valid characters in this area)
F2	U + 80000… U + BFFFF (currently no valid characters in this area)
F3	U + C0000… U + FFFFF
F4	U + 100000… U + 10FFFF (2nd byte must be from range 80… 8F!)

11110101-11110111 F5-F7 245-247 Invalid according to RFC 3629 : Start of a 4-byte sequence for code area above140000

11111000-11111011 F8-FB 248-251 Invalid according to RFC 3629 : Start of a 5-byte sequence

11111100-11111101 FC-FD 252-253 Invalid according to RFC 3629 : Start of a 6 byte long sequence

11111110-11111111 FE-FF 254-255 Invalid. Not defined in the original UTF-8 specification.

code	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F
0 ...	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
1…	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
2…	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3…	0	1	2	3	4th	5	6th	7th	8th	9	:	;	<	=	>	?
4…	@	A.	B.	C.	D.	E.	F.	G	H	I.	J	K	L.	M.	N	O
5…	P	Q	R.	S.	T	U	V	W.	X	Y	Z	[	\	]	^	_
6…	`	a	b	c	d	e	f	G	H	i	j	k	l	m	n	O
7…	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	DEL
8th…	Second, third or fourth byte of a byte sequence.
9 ...
A ...
B ...
C ...			Start of a 2 byte long sequence.
D ...			Start of a 2 byte long sequence.
E ...	Start of a 3 byte long sequence.
F ...	Start of a 4 byte long sequence.
	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F

Examples

Some coding examples for UTF-8 are given in the following table:

Examples of UTF-8 encodings
character	Unicode	Unicode binary	UTF-8 binary	UTF-8 hexadecimal
Letter y	U + 0079	00000000 0 1111001	0 1111001	79
Letter a	U + 00E4	00000 000 11 100 100	110 00011 10 100 100	C3 A4
Sign for registered trademark ®	U + 00AE	00000 000 10 101110	110 00010 10 101110	C2 AE
Euro sign €	U + 20AC	0010 0000 10 101100	1110 0010 10 000010 10 101100	E2 82 AC
Treble clef ?	U + 1D11E	000 000 01 1101 0001 00 011110	11110 000 10 011101 10 000 100 10 011110	F0 9D 84 9E

The last example lies outside the code area (16 bit) originally contained in Unicode (under version 2.0), which is contained in the current Unicode version as a BMP area (level 0) . Since many fonts do not yet contain these new Unicode areas, the characters contained there cannot be displayed correctly on many platforms. Instead, a replacement character is shown, which serves as a placeholder.

Representation in editors

Byte Order Mark

Although UTF-8 due to the type of coding principle can not occur the problem of different byte orders, some programs add a byte order mark (BOM, German byte order mark ) at the beginning of the file from UTF-8 files. The BOM consists of the byte sequence EF BB BF , which usually appears in non-UTF-8-capable text editors and browsers as the ISO-8859-1 character sequence ï »¿and can be responsible for compatibility problems.

Characters not in the Basic Latin Unicode block

The letters of the Latin basic alphabet and the most important punctuation marks are displayed identically in UTF-8 and ISO-8859- * . Problems with the wrongly chosen character encoding occur with the other characters, for example umlauts . In German-language texts, however, these characters only appear sporadically, so that the text appears badly distorted, but mostly remains legible.

In UTF-8, the umlauts of the German alphabet (if they are in the normal form NFC , i.e. as precomposed characters ) and the ß consist of two bytes; according to ISO 8859, each character is encoded as 1 byte and each byte is transformed into a character when reading. The first byte C3 _hex , which is common in the UTF-8 coding of these letters , is decoded differently, as can be seen in the table, as is the other byte of the coding from äöü, but with ÄÖÜß the second byte is not or with the same error Characters are shown because 7F _hex to 9F _hex are not defined in ISO 8859, which makes the text more difficult to read.

When interpreting a text encoded in ISO-8859 as UTF-8, the letters öü lead to the display of a replacement character because the corresponding byte value, as shown in the table below, is not defined. A start byte is assumed for the letters äöüß and attempts to interpret the next byte as a subsequent byte together as one character. Of course, this often fails because the coding of most of the letters is not valid subsequent bytes. In the case of an ä, an attempt is even made to interpret the next two bytes as subsequent bytes, which regularly fails for the same reasons. Depending on the programming of the displaying program, a corresponding number of letters may disappear from the text.

UTF-8 text opened with a different encoding:
UTF-8			ISO-8859-1	ISO-8859-15	UTF16
U + 00E4	C3A4 _hex	Ä	Ä	Ã €	쎤
U + 00F6	C3B6 _hex	ö	Ã¶	Ã¶	쎶
U + 00FC	C3BC _hex	ü	Ã¼	ÃŒ	쎼
U + 00DF	C39F _hex	ß	Ã ??	Ã ??	쎟
U + 00C4	C384 _hex	Ä	Ã ??	Ã ??	쎄
U + 00D6	C396 _hex	Ö	Ã ??	Ã ??	쎖
U + 00DC	C39C _hex	Ü	Ã ??	Ã	쎜


ISO Latin				1	2	3	4th	5	6th	7th	8th	9	10	UTF-8
ISO / IEC 8859-				1	2	3	4th	9	10	13	14th	15th	16	UTF-8
1010 0100	244	164	A4	¤				¤	Ī	¤	Ċ	€		Next byte	+24
1011 0110	266	182	B6	¶	ś	H	ļ	¶	ķ	¶				Next byte	+36
1011 1100	274	188	BC	¼	ź	ĵ	ŧ	¼	ž	¼	ỳ	Œ		Next byte	+ 3C
1100 0011	303	195	C3	Ã	Ă		Ã			Ć	Ã		Ă	Start byte	Latin 0080
1100 0100	304	196	C4	Ä										Start byte	Latin 00C0
1101 0110	326	214	D6	Ö										Start byte	Hebrew 0580
1101 1100	334	220	DC	Ü										Start byte	Syriac 0700
1101 1111	337	223	DF	ß										Start byte	N'Ko 07C0
1110 0100	344	228	E4	Ä										Start byte	Kana 3000
1111 0110	366	246	F6	ö										inadmissible
1111 1100	374	252	FC	ü										inadmissible
Am	Oct	Dec	Hex	ISO-Latin- ISO / IEC 8859-										UTF-8

An example of the word height :

UTF-8 text in ISO-8859-1 / 9 / 13-16 environment: Height → height . ; ISO-8859-1 text in UTF-8 environment; Height → H he or error message with abort. A byte with the hexadecimal value F6 is not allowed in UTF-8. It is common practice to insert the replacement character (U + FFFD) for non-convertible characters .

Web links

Wiktionary: UTF-8 - explanations of meanings, word origins, synonyms, translations

RFC 3629 - UTF-8, a transformation format of ISO 10646 (English)
UTF-8 code table with Unicode characters - UTF-8 coding of all Unicode positions from the BMP with additional information and named HTML entities
Dieter Pawelczak: Coding of strings. Example UCS / UTF8. In: University of the Federal Armed Forces, Munich. Institute for Software Engineering.
Pavel Radzivilovsky, Yakov Galka, Slava Novgorodov: UTF-8 Everywhere. Manifesto. (English)

Individual evidence

↑ RFC 3629 UTF-8, a transformation format of ISO 10646. Chapter 1 (Introduction), English.
↑ Historical trends in the usage of character encodings for websites. In: W3Techs. Q-Success, accessed on March 5, 2019 .
↑ Usage of character encodings broken down by ranking. In: W3Techs. Q-Success, accessed March 7, 2019 .
↑ Using International Characters in Internet Mail. ( Memento of October 26, 2007 in the Internet Archive ) Internet Mail Consortium, August 1, 1998, accessed July 12, 2012.
↑ Usage of character encodings for websites. In: W3Techs. Q-Success, accessed on July 12, 2012 (English, March 14, 2012).
↑ Norbert Lindenberg, Masayoshi Okutsu: Supplementary Characters in the Java Platform. In: Oracle website. Sun Microsystems, May 2004, accessed June 9, 2019 .

[1] RFC 3629 UTF-8, a transformation format of ISO 10646. Chapter 1 (Introduction), English.

[2] Historical trends in the usage of character encodings for websites. In: W3Techs. Q-Success, accessed on March 5, 2019 .

[3] Usage of character encodings broken down by ranking. In: W3Techs. Q-Success, accessed March 7, 2019 .

[4] Using International Characters in Internet Mail. ( Memento of October 26, 2007 in the Internet Archive ) Internet Mail Consortium, August 1, 1998, accessed July 12, 2012.

[5] Usage of character encodings for websites. In: W3Techs. Q-Success, accessed on July 12, 2012 (English, March 14, 2012).

[6] Norbert Lindenberg, Masayoshi Okutsu: Supplementary Characters in the Java Platform. In: Oracle website. Sun Microsystems, May 2004, accessed June 9, 2019 .