GB 18030

from Wikipedia, the free encyclopedia

The Chinese character encoding standard GB18030 is an encoding of all Unicode characters - currently 128,172 - including the currently encoded in Unicode 7.0 75,963 Han characters , which are Chinese characters and its variants in Japan, Korea and Vietnam. Since September 1, 2001 , it has been binding for all operating systems and programs sold in the People's Republic of China ; it is the successor standard for the encodings GBK and GB2312 and covers traditional and simplified characters. The official name is GB18030-2000 and contains GB for Guojia Biaozhun ( 國家 標準  /  国家 标准 ), which means national standard . The standard was published on March 17, 2000, and an update appeared on November 21, 2000 .

GB18030 can be seen as the Chinese equivalent of UTF-8 because it contains the code points for the entire Unicode range, even for code points that are not yet assigned today. Similar to UTF-8, it is a coding that is downwardly compatible with ASCII and represents over a million additional code points (in the 4-byte range of Unicode). In contrast to UTF-8, however, GB18030 maintains compatibility with GBK and GB2312; part of the allocation table was taken directly from GBK, the rest was determined algorithmically. In addition, GB18030 also includes the characters from the Taiwanese Big5 .

Most (western) computer systems had already standardized a variant of Unicode when GB18030 appeared. The technical simplification made to treat Unicode as fixed units with a 16-bit length UCS-2 could no longer be continued after its publication. Operating system manufacturers and programmers were, so to speak, forced by a "People's Republican decree" to use either variable formats such as UTF-8 or UTF-16 , or larger formats with a fixed width, such as UCS-4 or UTF-32 . Microsoft took this step with Windows 2000 ; Linux had already supported this before the introduction of GB18030.

The GB18030-coded computer font SimSun (Founder Extended) made glyphs , i.e. specific character representations, available for screen display and printouts for the entire character set of the then Unicode 3.0, i.e. H. already including the Unicode block " Unified CJK ideograms, extension A " and in anticipation also including the " extension B " from Unicode 3.1, which was only published in March of the following year 2001. Other well-known character sets with earlier support for "Extension A" are SimSun 18030 and Code2000 .

Structure of the characters

Sequences from one byte correspond to ASCII and range from 00 hex to 7F hex . Sequences of 2 bytes correspond to GB2312 and consist of a start byte from the range 81 hex … FE hex , followed by a byte from the range 40 hex … FE hex . Sequences of 4 bytes map the Unicode characters that have not been considered up to now. The first and third byte are from the range 81 hex … FE hex , the second and fourth byte from 30 hex … 39 hex . In contrast to UTF-8, one can not assume that an octet in the range 30 hex … 7F hex is for an ASCII character, but this byte value can have different meanings depending on its position.

code … 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F
0 ... NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1… DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2… SP ! " # $ % & ' ( ) * + , - . /
3… ASCII or second or fourth byte of a 4 byte long sequence. : ; < = > ?
4… ASCII or second byte of a 2 byte long sequence.
5…
6…
7… DEL
8th…
9 ... First or third byte of a 4-byte sequence or first or second byte of a 2-byte sequence.
A ...
B ...
C ...
D ...
E ...
F ...
… 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F

Web links