GBK

from Wikipedia, the free encyclopedia
Windows code pages
0874 Thai
0932 Japanese
0936 Simplified Chinese
0949 Korean
0950 Traditional Chinese
1250 Central European
1251 Cyrillic
1252 Western European
1253 Greek
1254 Turkish
1255 Hebrew
1256 Arabic
1257 Baltic
1258 Vietnamese

GBK (short for Chinese  国家 标准 扩展 , Pinyin Guójiā biāozhǔn kuòzhǎn ; from GB Standard and Chinese  汉字 内 码 扩展 规范 , Pinyin Hànzì nèimǎ kuòzhǎn guīfàn , English Chinese Internal Code Specification ) is a Chinese character set . It extends GB2312 to include traditional characters and characters that were simplified after the introduction of GB2312 in 1981.

history

In 1993, Unicode 1.1 was published, which contains 20,902 Chinese characters. The Chinese government then published GB13000.1-93, which is identical to Unicode 1.1. In order to bridge the gap between this standard and the older GB2312 (1980), GBK was also introduced, and GB2312 was expanded to include the characters from GB13000.1-93. However, because GBK never became an official standard, it was not given a regular GB number. 1995 GBK was expanded to include 95 more characters.

In Windows 95 , GBK was adopted unchanged as code page 936 . This increased the popularity of GBK enormously and GBK became the de facto standard. Later the euro symbol was added to code page 936, which made the code page incompatible with GBK.

However, in most flavors of Windows, GBK is misleadingly referred to as GB2312. It was not until Windows XP that the original GB2312 standard was also offered under Windows, under the code page number 20936 with the designation "GB2312-80".

GBK has been officially replaced by GB 18030 since 2000 .

construction

GBK is a 16-bit variable encoding; H. a character can be either one or two bytes in size. The characters in the range 00 hex -7F hex are identical to ASCII and consist of only one byte. The characters in the area 81 hex -FE hex, however, consist of two bytes.

A text coded in GBK can only be searched forwards, since it is not possible to distinguish between any character whether it is the beginning byte or the end byte of a two-byte coding. To distinguish, the text must be examined from the beginning. GBK has this disadvantageous property in common with GB2312 and GB18030 and the other Asian encodings SHIFT-JIS (Japanese), BIG-5 (traditional Chinese) and EUC-KR (Korean).

With GB2312, an ASCII character (byte value less than 128) found by a backward search can also be used as a starting point for a forward analysis, since these values ​​are not contained in two-byte characters; With GBK this option is reduced to ASCII characters in the range 0 to 63, since byte values ​​in the range 64 to 127 are also used as the end byte of a two-byte character.

The Unicode transformation UTF-8 avoids this problem . Although up to four bytes per character are required here, it can be clearly stated of each byte whether it is a one-byte character, a start byte of a multi-byte character or a continuation or end byte of a more Byte character is.

The two-byte area is divided into eight levels:

GBK levels
Level 1st byte 2nd byte Available code points character
GB 18030 GBK 1.0 GB 2312
Level GBK / 1 A1-A9 A1-FE 846 728 717 682
Level GBK / 2 B0-F7 A1-FE 6768 6763 6763
Level GBK / 3 81-A0 40- FEexcept7F 6080 6080
Level GBK / 4 AA-FE 40- A0except7F 8160 8160
Level GBK / 5 A8-A9 40- A0except7F 192 166
custom AA-AF A1-FE 564
custom F8-FE A1-FE 658
custom A1-A7 40- A0except7F 672
all in all: 23,940 21,897 21,886 7,445
code … 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F
0 ... NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1… DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2… SP ! " # $ % & ' ( ) * + , - . /
3… 0 1 2 3 4th 5 6th 7th 8th 9 : ; < = > ?
4… ASCII or second byte of a two-byte sequence.
5…
6…
7… DEL
8th…
9 ... First or second byte of a two-byte sequence.
A ...
B ...
C ...
D ...
E ...
F ...
… 0 …1 … 2 … 3 … 4 … 5 … 6 … 7 …8th … 9 … A … B ... C … D … E ... F

Web links