UTF-7

from Wikipedia, the free encyclopedia

UTF-7 is an encoding of the Unicode character set that is defined in RFC 2152 . UTF-7 is not part of the Unicode standard , despite the similarity of names to other encodings . UTF-7 allows the use of Unicode in non-8-bit fixed environments.

motivation

Many protocols on the Internet (such as SMTP for e-mail and NNTP for news) require the use of ASCII . This character encoding only allows 128 different characters that are stored in 7  bits . All other UTF encodings use at least 8 bits to encode a character. A transmission of UTF-8 would then require 7-bit encoding.

There are various coding methods (see MIME ), such as Base64 and Quoted-printable , which convert any 8-bit binary data into 7-bit ASCII text. Depending on these coding methods and the data to be coded, the amount of data is inflated by the coding. UTF-7 was designed to keep this additional data consumption when using texts that contain only a few Unicode characters as low as possible, and at the same time to allow text passages that can be represented in 7-bit ASCII to be read.

Coding

With UTF-7, the characters A–Z a–z 0–9'(),./:?-are transmitted as they are. The ASCII characters !"#$%&*;<=>@[]^_`{|} can be transmitted directly, but should also be encoded as they may not be transmitted correctly through all e-mail gateways .

All other characters are specially encoded. For this, a sequence of characters to be encoded is converted into a stream of ASCII characters as a stream of 2-byte characters ( UTF-16 , possibly with surrogates ) using a modified Base64 method (without terminating =). The start of such an encoded character sequence is indicated by a plus sign ( +), the end by a minus sign ( -) or by the first ASCII character, which cannot occur as a result of the Base64 encoding. Unnecessary bits in this coding must 0be set to.

In the case of English text, this coding can be read by people without further ado, since coded special characters only appear very rarely. However, the umlauts and special characters of other Western European languages ​​have to be encoded, which already noticeably distorts the text. Texts in languages ​​that do not use the Latin alphabet can no longer be easily read by humans.

Examples
  • For example, the text “Wikipedia - The Free Encyclopedia” becomes UTF-7 Wikipedia +IBM Die freie Enzyklop+AOQ-die.
  • The word oversize in UTF-7 becomes + ANw-bergr + APYA3w-e , which at 19 bytes is somewhat more compact than the 24 bytes required by the quoted-printable UTF-8: = C3 = 9Cbergr = C3 = B6 = C3 = 9Fe .

Despite its somewhat higher coding efficiency, UTF-7 has not been able to gain acceptance , since other methods such as quoted-printable and Base64 are understood by almost every e-mail and news program and the larger coding overhang does not play a role in practice.