GB 18030

The Chinese character encoding standard GB18030 is an encoding of all Unicode characters - currently 128,172 - including the currently encoded in Unicode 7.0 75,963 Han characters , which are Chinese characters and its variants in Japan, Korea and Vietnam. Since September 1, 2001 , it has been binding for all operating systems and programs sold in the People's Republic of China ; it is the successor standard for the encodings GBK and GB2312 and covers traditional and simplified characters. The official name is GB18030-2000 and contains GB for Guojia Biaozhun ( 國家標準 / 国家标准 ), which means national standard . The standard was published on March 17, 2000, and an update appeared on November 21, 2000 .

GB18030 can be seen as the Chinese equivalent of UTF-8 because it contains the code points for the entire Unicode range, even for code points that are not yet assigned today. Similar to UTF-8, it is a coding that is downwardly compatible with ASCII and represents over a million additional code points (in the 4-byte range of Unicode). In contrast to UTF-8, however, GB18030 maintains compatibility with GBK and GB2312; part of the allocation table was taken directly from GBK, the rest was determined algorithmically. In addition, GB18030 also includes the characters from the Taiwanese Big5 .

Most (western) computer systems had already standardized a variant of Unicode when GB18030 appeared. The technical simplification made to treat Unicode as fixed units with a 16-bit length UCS-2 could no longer be continued after its publication. Operating system manufacturers and programmers were, so to speak, forced by a "People's Republican decree" to use either variable formats such as UTF-8 or UTF-16 , or larger formats with a fixed width, such as UCS-4 or UTF-32 . Microsoft took this step with Windows 2000 ; Linux had already supported this before the introduction of GB18030.

The GB18030-coded computer font SimSun (Founder Extended) made glyphs , i.e. specific character representations, available for screen display and printouts for the entire character set of the then Unicode 3.0, i.e. H. already including the Unicode block " Unified CJK ideograms, extension A " and in anticipation also including the " extension B " from Unicode 3.1, which was only published in March of the following year 2001. Other well-known character sets with earlier support for "Extension A" are SimSun 18030 and Code2000 .

Structure of the characters

Sequences from one byte correspond to ASCII and range from 00 _hex to 7F _hex . Sequences of 2 bytes correspond to GB2312 and consist of a start byte from the range 81 _hex … FE _hex , followed by a byte from the range 40 _hex … FE _hex . Sequences of 4 bytes map the Unicode characters that have not been considered up to now. The first and third byte are from the range 81 _hex … FE _hex , the second and fourth byte from 30 _hex … 39 _hex . In contrast to UTF-8, one can not assume that an octet in the range 30 _hex … 7F _hex is for an ASCII character, but this byte value can have different meanings depending on its position.

code	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F
0 ...	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
1…	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
2…	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3…	ASCII or second or fourth byte of a 4 byte long sequence.										:	;	<	=	>	?
4…	ASCII or second byte of a 2 byte long sequence.
5…
6…
7…																DEL
8th…
9 ...	First or third byte of a 4-byte sequence or first or second byte of a 2-byte sequence.
A ...
B ...
C ...
D ...
E ...
F ...
	… 0	…1	… 2	… 3	… 4	… 5	… 6	… 7	…8th	… 9	… A	… B	... C	… D	… E	... F

Web links

IANA Charset Registration for GB18030
English summary of GB 18030-2000 (PDF file; 408 kB)
Authoritative concordance between GB18030 and Unicode (warning: technical difficulties when loading in the browser possible).
ICU Converter Explorer: GB18030
Unicode CJK Unified Ideographs Extension A (PDF, 1.5 MB)
Unicode CJK Unified Ideographs Extension B (PDF, 13 MB)
GB18030 Support Package for Windows 2000 / XP, including Chinese, Tibetan, Yi, Mongolian and Thai from Microsoft
SIL's free fonts, editors and documentation