Character set table

from Wikipedia, the free encyclopedia

Text, words and characters are represented by numbers in computers, so it is necessary to establish an association between numbers and characters. This assignment is defined by a character set table which assigns numerical values ​​to the characters and control characters that can be displayed . Alternative terms for the character set table are code page or character map .

history

Historical character set tables are often limited to 256 characters, which in turn means that a character set table with 256 characters can usually only store one additional alphabet in addition to the Latin alphabet. However, the use of these early, simple character set tables created problems. In some character set tables not all characters are adequately documented, or certain entries in the character set table are used differently. Furthermore, a text can often only use one character set table, which makes it difficult to integrate characters from other languages ​​into the text. To solve these problems, Unicode was introduced. In contrast to normal character set tables, Unicode separates the assignment of numbers (so-called code points ) to characters and the coding of the characters. The various Unicode coding schemes can in turn be understood as character set tables. While a character set table defines the assignment of numbers to characters, fonts store the appearance of the characters. To display text on computers, both a character set table and a font are usually necessary.

The display of texts or file names with the wrong character set table leads to the display of incorrect characters. In German texts, umlauts or Eszett often suffer from this , even if the text remains largely legible. Texts with other writing systems, on the other hand, become illegible when displayed with the wrong character set table ( Mojibake ).

Examples

IBM PC (OEM) character set tables

These character set tables should only be used for compatibility with existing documents and systems. The use of Unicode is recommended for new systems and texts.

DBCS / MBCS

These code pages allow the storage of Asian characters for which the 256 characters resulting from 8 bits are not sufficient. 16-bit tuples are used for this ( DBCS / MBCS ), which allow up to 65536 different characters.

Important character set tables

For efficient processing on computers, character set tables are identified by numbers. The numbering of the character set tables is not standardized, however, so that different computers or operating systems can use different numbers.

Code page number meaning Character encoding
437 The original character set table of the IBM PC char (8 bit)
720 Arabic alphabet char (8 bit)
737 Greek alphabet char (8 bit)
775 Estonian alphabet , Lithuanian alphabet, and Latvian alphabet char (8 bit)
819 "Latin-1", corresponds to ISO 8859-1 char (8 bit)
850 "Multilingual (DOS-Latin-1)", Western European languages char (8 bit)
852 Slavic languages ( Latin-2 ), Central European and Eastern European languages char (8 bit)
855 Cyrillic alphabet char (8 bit)
857 Turkish alphabet char (8 bit)
858 "Multilingual" with euro symbols char (8 bit)
860 Latin alphabet with special Portuguese characters char (8 bit)
861 Icelandic alphabet char (8 bit)
862 Hebrew alphabet char (8 bit)
863 Latin alphabet with special French characters char (8 bit)
864 Arabic alphabet char (8 bit)
865 Danish and Norwegian - differs from 437 only by Ø (ø) instead of ¥ and ¢ char (8 bit)
866 Cyrillic alphabet char (8 bit)
869 Greek alphabet char (8 bit)
874 Thai alphabet char (8 bit)
932 Japanese writing systems ( DBCS ) Mixed 8 and 16 bit
936 GBK for Chinese Abbreviations ( DBCS ) Mixed 8 and 16 bit
949 Hangul / Korean characters ( DBCS ) Mixed 8 and 16 bit
950 Traditional Chinese characters ( DBCS ) Mixed 8 and 16 bit
1200 UTF-16 LE little-endian ( Unicode ) Tuples of 16-bit words
1201 UTF-16 BE big-endian ( Unicode ) Tuples of 16-bit words
1250 Central and Eastern European languages char (8 bit)
1251 Cyrillic alphabet char (8 bit)
1252 Western European languages char (8 bit)
1253 Greek alphabet char (8 bit)
1254 Turkish alphabet char (8 bit)
1255 Hebrew alphabet char (8 bit)
1256 Arabic alphabet char (8 bit)
1257 Baltic languages char (8 bit)
1258 Vietnamese languages char (8 bit)
10,000 Macintosh novel char (8 bit)
10007 Macintosh Cyrillic char (8 bit)
10029 Macintosh, Central European Languages char (8 bit)
20127 US-ASCII char (7 bit)
28591 ISO-8859-1 char (8 bit)
65000 UTF-7 ( Unicode ) Tuples of 8-bit words
65001 UTF-8 ( Unicode ) Tuples of 8-bit words

See also

Web links