GBK

Windows code pages
0874	Thai
0932	Japanese
0936	Simplified Chinese
0949	Korean
0950	Traditional Chinese
1250	Central European
1251	Cyrillic
1252	Western European
1253	Greek
1254	Turkish
1255	Hebrew
1256	Arabic
1257	Baltic
1258	Vietnamese

GBK (short for Chinese 国家标准扩展 , Pinyin Guójiā biāozhǔn kuòzhǎn ; from GB Standard and Chinese 汉字内码扩展规范 , Pinyin Hànzì nèimǎ kuòzhǎn guīfàn , English Chinese Internal Code Specification ) is a Chinese character set . It extends GB2312 to include traditional characters and characters that were simplified after the introduction of GB2312 in 1981.

history

In 1993, Unicode 1.1 was published, which contains 20,902 Chinese characters. The Chinese government then published GB13000.1-93, which is identical to Unicode 1.1. In order to bridge the gap between this standard and the older GB2312 (1980), GBK was also introduced, and GB2312 was expanded to include the characters from GB13000.1-93. However, because GBK never became an official standard, it was not given a regular GB number. 1995 GBK was expanded to include 95 more characters.

In Windows 95 , GBK was adopted unchanged as code page 936 . This increased the popularity of GBK enormously and GBK became the de facto standard. Later the euro symbol was added to code page 936, which made the code page incompatible with GBK.

However, in most flavors of Windows, GBK is misleadingly referred to as GB2312. It was not until Windows XP that the original GB2312 standard was also offered under Windows, under the code page number 20936 with the designation "GB2312-80".

GBK has been officially replaced by GB 18030 since 2000 .

construction

GBK is a 16-bit variable encoding; H. a character can be either one or two bytes in size. The characters in the range 00 _hex -7F _hex are identical to ASCII and consist of only one byte. The characters in the area 81 _hex -FE _hex, however, consist of two bytes.

A text coded in GBK can only be searched forwards, since it is not possible to distinguish between any character whether it is the beginning byte or the end byte of a two-byte coding. To distinguish, the text must be examined from the beginning. GBK has this disadvantageous property in common with GB2312 and GB18030 and the other Asian encodings SHIFT-JIS (Japanese), BIG-5 (traditional Chinese) and EUC-KR (Korean).

With GB2312, an ASCII character (byte value less than 128) found by a backward search can also be used as a starting point for a forward analysis, since these values are not contained in two-byte characters; With GBK this option is reduced to ASCII characters in the range 0 to 63, since byte values in the range 64 to 127 are also used as the end byte of a two-byte character.

The Unicode transformation UTF-8 avoids this problem . Although up to four bytes per character are required here, it can be clearly stated of each byte whether it is a one-byte character, a start byte of a multi-byte character or a continuation or end byte of a more Byte character is.

The two-byte area is divided into eight levels:

GBK levels
Level	1st byte	2nd byte	Available code points	character
Level	1st byte	2nd byte	Available code points	GB 18030	GBK 1.0	GB 2312
Level GBK / 1	`A1`-`A9`	`A1`-`FE`	846	728	717	682
Level GBK / 2	`B0`-`F7`	`A1`-`FE`	6768	6763		6763
Level GBK / 3	`81`-`A0`	`40`- `FE`except`7F`	6080	6080
Level GBK / 4	`AA`-`FE`	`40`- `A0`except`7F`	8160	8160
Level GBK / 5	`A8`-`A9`	`40`- `A0`except`7F`	192	166
custom	`AA`-`AF`	`A1`-`FE`	564
custom	`F8`-`FE`	`A1`-`FE`	658
custom	`A1`-`A7`	`40`- `A0`except`7F`	672
all in all:			23,940	21,897	21,886	7,445

code	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F
0 ...	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
1…	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
2…	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3…	0	1	2	3	4th	5	6th	7th	8th	9	:	;	<	=	>	?
4…	ASCII or second byte of a two-byte sequence.
5…
6…
7…																DEL
8th…
9 ...	First or second byte of a two-byte sequence.
A ...
B ...
C ...
D ...
E ...
F ...
	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F

Web links

Windows code page 936
Development of Chinese character encodings (link no longer correct)
Chinese character encoding