Character set table
Text, words and characters are represented by numbers in computers, so it is necessary to establish an association between numbers and characters. This assignment is defined by a character set table which assigns numerical values to the characters and control characters that can be displayed . Alternative terms for the character set table are code page or character map .
history
Historical character set tables are often limited to 256 characters, which in turn means that a character set table with 256 characters can usually only store one additional alphabet in addition to the Latin alphabet. However, the use of these early, simple character set tables created problems. In some character set tables not all characters are adequately documented, or certain entries in the character set table are used differently. Furthermore, a text can often only use one character set table, which makes it difficult to integrate characters from other languages into the text. To solve these problems, Unicode was introduced. In contrast to normal character set tables, Unicode separates the assignment of numbers (so-called code points ) to characters and the coding of the characters. The various Unicode coding schemes can in turn be understood as character set tables. While a character set table defines the assignment of numbers to characters, fonts store the appearance of the characters. To display text on computers, both a character set table and a font are usually necessary.
The display of texts or file names with the wrong character set table leads to the display of incorrect characters. In German texts, umlauts or Eszett often suffer from this , even if the text remains largely legible. Texts with other writing systems, on the other hand, become illegible when displayed with the wrong character set table ( Mojibake ).
Examples
IBM PC (OEM) character set tables
These character set tables should only be used for compatibility with existing documents and systems. The use of Unicode is recommended for new systems and texts.
DBCS / MBCS
These code pages allow the storage of Asian characters for which the 256 characters resulting from 8 bits are not sufficient. 16-bit tuples are used for this ( DBCS / MBCS ), which allow up to 65536 different characters.
Important character set tables
For efficient processing on computers, character set tables are identified by numbers. The numbering of the character set tables is not standardized, however, so that different computers or operating systems can use different numbers.
Code page number | meaning | Character encoding |
---|---|---|
437 | The original character set table of the IBM PC | char (8 bit) |
720 | Arabic alphabet | char (8 bit) |
737 | Greek alphabet | char (8 bit) |
775 | Estonian alphabet , Lithuanian alphabet, and Latvian alphabet | char (8 bit) |
819 | "Latin-1", corresponds to ISO 8859-1 | char (8 bit) |
850 | "Multilingual (DOS-Latin-1)", Western European languages | char (8 bit) |
852 | Slavic languages ( Latin-2 ), Central European and Eastern European languages | char (8 bit) |
855 | Cyrillic alphabet | char (8 bit) |
857 | Turkish alphabet | char (8 bit) |
858 | "Multilingual" with euro symbols | char (8 bit) |
860 | Latin alphabet with special Portuguese characters | char (8 bit) |
861 | Icelandic alphabet | char (8 bit) |
862 | Hebrew alphabet | char (8 bit) |
863 | Latin alphabet with special French characters | char (8 bit) |
864 | Arabic alphabet | char (8 bit) |
865 | Danish and Norwegian - differs from 437 only by Ø (ø) instead of ¥ and ¢ | char (8 bit) |
866 | Cyrillic alphabet | char (8 bit) |
869 | Greek alphabet | char (8 bit) |
874 | Thai alphabet | char (8 bit) |
932 | Japanese writing systems ( DBCS ) | Mixed 8 and 16 bit |
936 | GBK for Chinese Abbreviations ( DBCS ) | Mixed 8 and 16 bit |
949 | Hangul / Korean characters ( DBCS ) | Mixed 8 and 16 bit |
950 | Traditional Chinese characters ( DBCS ) | Mixed 8 and 16 bit |
1200 | UTF-16 LE little-endian ( Unicode ) | Tuples of 16-bit words |
1201 | UTF-16 BE big-endian ( Unicode ) | Tuples of 16-bit words |
1250 | Central and Eastern European languages | char (8 bit) |
1251 | Cyrillic alphabet | char (8 bit) |
1252 | Western European languages | char (8 bit) |
1253 | Greek alphabet | char (8 bit) |
1254 | Turkish alphabet | char (8 bit) |
1255 | Hebrew alphabet | char (8 bit) |
1256 | Arabic alphabet | char (8 bit) |
1257 | Baltic languages | char (8 bit) |
1258 | Vietnamese languages | char (8 bit) |
10,000 | Macintosh novel | char (8 bit) |
10007 | Macintosh Cyrillic | char (8 bit) |
10029 | Macintosh, Central European Languages | char (8 bit) |
20127 | US-ASCII | char (7 bit) |
28591 | ISO-8859-1 | char (8 bit) |
65000 | UTF-7 ( Unicode ) | Tuples of 8-bit words |
65001 | UTF-8 ( Unicode ) | Tuples of 8-bit words |