GBK
874 | Thai |
932 | Japanese |
936 | Simplified Chinese |
949 | Korean |
950 | Traditional Chinese |
1250 | Central European |
1251 | Cyrillic |
1252 | Western European |
1253 | Greek |
1254 | Turkish |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
1258 | Vietnamese |
GBK (short for Chinese 国家 标准 扩展 , Pinyin Guójiā biāozhǔn kuòzhǎn ; from GB Standard and Chinese 汉字 内 码 扩展 规范 , Pinyin Hànzì nèimǎ kuòzhǎn guīfàn , English Chinese Internal Code Specification ) is a Chinese character set . It extends GB2312 to include traditional characters and characters that were simplified after the introduction of GB2312 in 1981.
history
In 1993, Unicode 1.1 was published, which contains 20,902 Chinese characters. The Chinese government then published GB13000.1-93, which is identical to Unicode 1.1. In order to bridge the gap between this standard and the older GB2312 (1980), GBK was also introduced, and GB2312 was expanded to include the characters from GB13000.1-93. However, because GBK never became an official standard, it was not given a regular GB number. 1995 GBK was expanded to include 95 more characters.
In Windows 95 , GBK was adopted unchanged as code page 936 . This increased the popularity of GBK enormously and GBK became the de facto standard. Later the euro symbol was added to code page 936, which made the code page incompatible with GBK.
However, in most flavors of Windows, GBK is misleadingly referred to as GB2312. It was not until Windows XP that the original GB2312 standard was also offered under Windows, under the code page number 20936 with the designation "GB2312-80".
GBK has been officially replaced by GB 18030 since 2000 .
construction
GBK is a 16-bit variable encoding; H. a character can be either one or two bytes in size. The characters in the range 00 hex -7F hex are identical to ASCII and consist of only one byte. The characters in the area 81 hex -FE hex, however, consist of two bytes.
A text coded in GBK can only be searched forwards, since it is not possible to distinguish between any character whether it is the beginning byte or the end byte of a two-byte coding. To distinguish, the text must be examined from the beginning. GBK has this disadvantageous property in common with GB2312 and GB18030 and the other Asian encodings SHIFT-JIS (Japanese), BIG-5 (traditional Chinese) and EUC-KR (Korean).
With GB2312, an ASCII character (byte value less than 128) found by a backward search can also be used as a starting point for a forward analysis, since these values are not contained in two-byte characters; With GBK this option is reduced to ASCII characters in the range 0 to 63, since byte values in the range 64 to 127 are also used as the end byte of a two-byte character.
The Unicode transformation UTF-8 avoids this problem . Although up to four bytes per character are required here, it can be clearly stated of each byte whether it is a one-byte character, a start byte of a multi-byte character or a continuation or end byte of a more Byte character is.
The two-byte area is divided into eight levels:
Level | 1st byte | 2nd byte | Available code points | character | ||
---|---|---|---|---|---|---|
GB 18030 | GBK 1.0 | GB 2312 | ||||
Level GBK / 1 |
A1 -A9
|
A1 -FE
|
846 | 728 | 717 | 682 |
Level GBK / 2 |
B0 -F7
|
A1 -FE
|
6768 | 6763 | 6763 | |
Level GBK / 3 |
81 -A0
|
40 - FE except7F
|
6080 | 6080 | ||
Level GBK / 4 |
AA -FE
|
40 - A0 except7F
|
8160 | 8160 | ||
Level GBK / 5 |
A8 -A9
|
40 - A0 except7F
|
192 | 166 | ||
custom |
AA -AF
|
A1 -FE
|
564 | |||
custom |
F8 -FE
|
A1 -FE
|
658 | |||
custom |
A1 -A7
|
40 - A0 except7F
|
672 | |||
all in all: | 23,940 | 21,897 | 21,886 | 7,445 |
code | … 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 ... | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1… | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2… | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3… | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | : | ; | < | = | > | ? |
4… | ||||||||||||||||
5… | ||||||||||||||||
6… | ||||||||||||||||
7… | DEL | |||||||||||||||
8th… | ||||||||||||||||
9 ... | ||||||||||||||||
A ... | ||||||||||||||||
B ... | ||||||||||||||||
C ... | ||||||||||||||||
D ... | ||||||||||||||||
E ... | ||||||||||||||||
F ... | ||||||||||||||||
… 0 | …1 | … 2 | … 3 | … 4 | … 5 | … 6 | … 7 | …8th | … 9 | … A | … B | ... C | … D | … E | ... F |
Web links
- Windows code page 936
- Development of Chinese character encodings (link no longer correct)
- Chinese character encoding