Character encoding

A character encoding ( English encoding character just Encoding ) allows unambiguous assignment of characters (i. A. letters or numbers ) and symbols within a character set . In electronic data processing , characters are encoded using a numerical value in order to transmit or store them . For example, the German umlaut Ü is encoded in the ISO-8859-1 character set with the decimal value 220. In the EBCDIC character set, the same value 220 encodes the curly bracket }. In order to display a character correctly, the character encoding must be known; the numerical value alone is not enough.

Numerical values from character encodings can be saved or transmitted in various ways, e.g. B. Morse code , different high tones ( fax machine ), different high voltages .

Binary systems have always been of particular importance, since the greater the number of basic elements of the code, the greater the risk of confusion.

history

The beginnings of this technique go back to antiquity . For example, Agamemnon informed his troops from a ship with the light of a fire that he wanted to start the invasion of Troy . Also known are smoke signals among the Indians or the transmission of messages by drum signals in Africa .

The techniques were later refined, especially for the communication of ship formations in nautical science . Sir Walter Raleigh invented a kind of forerunner of flag coding for the communication of his squadron on the South America voyage in 1617.

In 1648 it was England's future King James II who introduced the first signal flag system in the British Navy.

After the invention of telegraphy , a character coding was also required here. From the original ideas of the Englishman Alfred Brain , the original Morse code emerged in 1837 and the modified Morse code in 1844.

The CCITT (Comité Consultatif International Telegraphique et Telephonique) was ultimately the first institution to define a standardized character set . This character set was based on a 5-digit code alphabet developed by Jean-Maurice-Émile Baudot in 1870 for his synchronous telegraph , the Baudot code , the principle of which is still used today.

Computer and data exchange

With the development of the computer , the implementation of the binary character coding , which has basically been used since the Baudot code , began in bit sequences, or internally mostly in different electrical voltage values as a distinguishing criterion, completely analogous to the pitch or signal duration previously used to differentiate the signal values.

In order to assign representable characters to these bit sequences, translation tables, so-called character sets , had to be used. Charsets . In 1963 the first 7-bit version of the ASCII code was defined by the ASA (American Standards Association) in order to standardize the character encoding. Although IBM had worked on the definition, it introduced its own 8-bit character code EBCDIC in 1964 . Both are still used in computer technology today.

Since many languages require different diacritical marks with which letters of the Latin writing system are modified, there are separate character sets for many language groups. The ISO has with the series of standards ISO 8859 character encoding for all European languages (including Turkish ) and Arabic , Hebrew and Thai standardized.

The Unicode Consortium finally published a first version of the standard of the same name in 1991 , which aims to define all characters in all languages in code form. Unicode is also the international standard ISO 10646 .

Before a text is processed electronically, the character set used and the character encoding must be determined. The following information is used, for example:

Definition of the character set in an HTML page

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Definition of the character set in the headers of an e-mail or an HTTP packet

Content-Type: text/plain; charset=ISO-8859-1

Graphic representation

The presence of software for character encoding and decoding does not guarantee the correct display on the computer screen. For this purpose, a font must also be available that contains the characters of the character set.

Differentiation of the terms through the introduction of the Unicode

With the introduction of Unicode , the characters had to be represented by more than one byte, and more precise terms became necessary. Currently in German the terms character set , code , coding , encoding are sometimes used synonymously, sometimes differentiating. In English there are already clear differentiations:

A font ( character set or character repertoire ) is a set S of different characters.
A code amount or code space ( code space ) is a finite subset M of the natural numbers .
A character code ( ccs , coded character set , codepage ) is a character set S with a code set M and an injective mapping of the characters in S to the numbers in M.
A code point ( codepoint or encoded character ) is an element of the code set M that designates its associated character from S. A text is represented by the code points of its characters, i.e. as a sequence of numbers from M.
An encoded character ( encoded character ) is a sign of S together with its code point M.

Next you have to define the representation of the code points in the computer ( encoding ). The encoding is divided into two parts: encoding form and encoding scheme . A generally accepted translation of these terms does not yet exist in German.

The encoding form ( character encoding form , cef ) describes a mapping of the code points to byte sequences. A byte or a sequence of several bytes is assigned to each code point, although the length of the byte sequences does not have to be the same for all code points. Such a byte sequence that represents a character is called a code unit .
The encoding scheme ( character encoding scheme , ces ) describes the byte order with which a code unit is stored in memory.

In simple cases there are no more than 256 = 2 ⁸ code points, so that each code point can be stored in one byte , e.g. B. when using one of the character codes defined in ISO 8859 . This is no longer possible with Unicode, as S contains far more than 256 characters. Common encodings are UTF-8, UTF-16, UCS-2 and UTF-32.

With UTF-16 (cef) the code points between 0 and 2 ¹⁶ -1 are saved in two bytes and all larger ones in four bytes. As with all encodings with more than one byte element length, there are at least the two schemes (ces) UTF-16BE ( big-endian ) and UTF-16LE (little-endian), which differ in the order of the bytes in a code unit .

When UTF-32 is used always four bytes for each code point, and UTF-8 used depending on the code point one or more bytes: the code points 0 through 127 are stored in a single byte, so this representation with most and European texts English space-saving because the characters with these code points ( ASCII characters ) are by far the most common. Other methods include: SCSU , BOCU and Punycode . Complex schemes can switch between several variants (ISO / IEC 2022).

In order to clearly indicate the order of the bytes in a code unit , a BOM ( byte order mark ) is often prefixed (0xEF, 0xBB, 0xBF for UTF-8; 0xFF, 0xFE for UTF-16LE; 0xFE, 0xFF for UTF-16BE).

A glyph is a graphic representation of a single character.

Example: The Chinese character for mountain, shān , 山 has the code point U + 5C71 = 山 in Unicode and requires 15 bits for representation. With UTF-16 as cef, it is stored as a code unit. With ces big-endian there is 5C, 71 in memory, with little-endian 71, 5C. With UTF-8 the three units E5, B1, B1 are in the memory. The glyph is 山.

To make things easier for the confused reader, it should be noted that the vast majority of texts are stored in one of the three Unicode encodings UTF-8, UTF-16BE or UTF-16LE, which makes it much easier to work with texts.

literature

Jacques André: Caractères numériques: introduction. In: Cahiers GUTenberg. Vol. 26, May 1997, ISSN 1257-2217 , pp. 5-44, (in French).
Yannis Haralambous: Fonts & encodings. From Unicode to advanced typography and everything in between. Translated by P. Scott Horne. O'Reilly, Beijing et al. 2007, ISBN 978-0-596-10242-5 (in English).
Peter Karow: Digital Fonts. Presentation and formats. 2nd improved edition. Springer, Berlin et al. 1992, ISBN 3-540-54917-X .
Mai-Linh Thi Truong, Jürgen Siebert, Erik Spiekermann (Eds.): FontBook. Digital Typeface Compendium (= FontBook 4). 4th revised and expanded edition. FSI FontShop International, Berlin 2006, ISBN 3-930023-04-0 (in English).

Web links

Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) ( German )
Character Set Converter - Windows program for converting character sets.
No Such Thing As Plain Text