GB 18030
The Chinese character encoding standard GB18030 is an encoding of all Unicode characters - currently 128,172 - including the currently encoded in Unicode 7.0 75,963 Han characters , which are Chinese characters and its variants in Japan, Korea and Vietnam. Since September 1, 2001 , it has been binding for all operating systems and programs sold in the People's Republic of China ; it is the successor standard for the encodings GBK and GB2312 and covers traditional and simplified characters. The official name is GB18030-2000 and contains GB for Guojia Biaozhun ( 國家 標準 / 国家 标准 ), which means national standard . The standard was published on March 17, 2000, and an update appeared on November 21, 2000 .
GB18030 can be seen as the Chinese equivalent of UTF-8 because it contains the code points for the entire Unicode range, even for code points that are not yet assigned today. Similar to UTF-8, it is a coding that is downwardly compatible with ASCII and represents over a million additional code points (in the 4-byte range of Unicode). In contrast to UTF-8, however, GB18030 maintains compatibility with GBK and GB2312; part of the allocation table was taken directly from GBK, the rest was determined algorithmically. In addition, GB18030 also includes the characters from the Taiwanese Big5 .
Most (western) computer systems had already standardized a variant of Unicode when GB18030 appeared. The technical simplification made to treat Unicode as fixed units with a 16-bit length UCS-2 could no longer be continued after its publication. Operating system manufacturers and programmers were, so to speak, forced by a "People's Republican decree" to use either variable formats such as UTF-8 or UTF-16 , or larger formats with a fixed width, such as UCS-4 or UTF-32 . Microsoft took this step with Windows 2000 ; Linux had already supported this before the introduction of GB18030.
The GB18030-coded computer font SimSun (Founder Extended) made glyphs , i.e. specific character representations, available for screen display and printouts for the entire character set of the then Unicode 3.0, i.e. H. already including the Unicode block " Unified CJK ideograms, extension A " and in anticipation also including the " extension B " from Unicode 3.1, which was only published in March of the following year 2001. Other well-known character sets with earlier support for "Extension A" are SimSun 18030 and Code2000 .
Structure of the characters
Sequences from one byte correspond to ASCII and range from 00 hex to 7F hex . Sequences of 2 bytes correspond to GB2312 and consist of a start byte from the range 81 hex … FE hex , followed by a byte from the range 40 hex … FE hex . Sequences of 4 bytes map the Unicode characters that have not been considered up to now. The first and third byte are from the range 81 hex … FE hex , the second and fourth byte from 30 hex … 39 hex . In contrast to UTF-8, one can not assume that an octet in the range 30 hex … 7F hex is for an ASCII character, but this byte value can have different meanings depending on its position.
code | … 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 ... | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1… | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2… | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3… | ASCII or second or fourth byte of a 4 byte long sequence. | : | ; | < | = | > | ? | |||||||||
4… | ||||||||||||||||
5… | ||||||||||||||||
6… | ||||||||||||||||
7… | DEL | |||||||||||||||
8th… | ||||||||||||||||
9 ... | ||||||||||||||||
A ... | ||||||||||||||||
B ... | ||||||||||||||||
C ... | ||||||||||||||||
D ... | ||||||||||||||||
E ... | ||||||||||||||||
F ... | ||||||||||||||||
… 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
Web links
- IANA Charset Registration for GB18030
- English summary of GB 18030-2000 (PDF file; 408 kB)
- Authoritative concordance between GB18030 and Unicode (warning: technical difficulties when loading in the browser possible).
- ICU Converter Explorer: GB18030
- Unicode CJK Unified Ideographs Extension A (PDF, 1.5 MB)
- Unicode CJK Unified Ideographs Extension B (PDF, 13 MB)
- GB18030 Support Package for Windows 2000 / XP, including Chinese, Tibetan, Yi, Mongolian and Thai from Microsoft
- SIL's free fonts, editors and documentation