Chinese character encoding

Chinese character encodings ( Chinese 漢字編碼方法 / 汉字编码方法 , Pinyin Hànzì biānmǎ fāngfǎ ) assign the Chinese characters byte sequences for processing and storage in the computer. All Chinese character encodings also contain an encoding of the ASCII characters.

There is probably no other language or script with as many coding and input methods as there is in Chinese. According to statistics, the number of encoding concepts for inputting Chinese characters exceeds five hundred. There are around 40 to 50 different codes just for designed software that has been formally tested on the computer. However, no more than ten are commercializable and in common use.

This obviously has to do with the high number of Chinese characters and the complicated shape, at the same time there is a direct connection with the fact that there are a lot of dialects in China , the language and writing in the individual regions do not match and the general high-level language is not yet widespread enough.

Coding and input

Most encoding methods for Chinese characters entered with the keyboard can be roughly divided into four categories:

"Flowing coding" ( 流水碼 / 流水码 , Liúshuǐmǎ ),
Coding according to the form of the character ( 字形碼 / 字形码 , Zìxíngmǎ ),
Coding according to the sound of the character ( 字音碼 / 字音码 , Zìyīnmǎ ),
Coding according to the sound and form of the character ( 形音碼 / 形音码 , Xíngyīnmǎ or 音形碼 / 音形码 , Yīnxíngmǎ ).

Liushui coding

Also called 無理碼 / 无理码 , wúlǐmǎ (unreasonable coding).

Arabic numerals or Latin letters are commonly used to encode the Chinese characters, for example the Sima-dianbao , an encrypted telegram code used by the Ministry of Post and Telecommunications, was a typical Liushui encoding. In principle, the numbers from 0001 to 9999 can be used to code almost ten thousand characters. The coding can be used to write telegrams, but the Ministry of Post and Telecommunications also used it as a method of coding the Chinese characters.

The Guojia biaozhun (in German: national standard), (Version: "Information exchange with the basic collection of characters for coding Chinese characters (GB 2312-80)"), encodes 6763 Chinese characters with the positions 1601 to 8794 from the order of the Liushui Codes. This is the code known as 區位碼 / 区位码 , Qūwèimǎ ( zone code). The telegram code of the two characters 中国 ( Zhōngguó , in German: China) is 0022 and 0948, and the zone code is 5448 and 2590.

Coding according to the shape of the character

The coding according to the shape of the characters can be divided into three types: coding for the shape of the lines, coding for the root of the character , coding for characteristics of the character.

Coding for the shape of the lines

The coding for the shape of the bars uses the most basic bars as input units.

Li Jinkai's eight- bar coding is a typical coding for the shape of the bars. He divides the lines of the Chinese characters into eight types: “ 一 ” Heng , “ 丨 ” Shu , “ 丿 ” Pie , “ 丶 ” Dian , Zhe , Wan , Cha , Fang , and encodes them with the digits from one to eight . For example, the coding for the two characters 中国 is 82 and 81714.

The bar coding in Wubizixing code is the "method of the divided character". The dashes " 一 " Heng , " 丨 " Shu , " 丿 " Pie , Na , Zhe , are encoded with the digits from one to five.

Coding for the root of the character

Also called radical coding or structure coding, with the radicals of the Chinese characters as input units.

Wang Yongmin's wubizixing code is typical for encoding the root of the character. He summarized 130 basic root characters, arranged them on the keyboard, six root characters on each key, a key is used several times. The "L" key is z. B. for 车, 力, 甲, 田, 四, 口 . When entering, you press the corresponding keys with the letter combination and you can enter the required character. For example, if you press “khk” and “lgyi”, the two characters 中国 are displayed on the screen .

Coding for characteristics of the character

It is coded according to the laws of the contour features of the Chinese characters. Examples: 角碼 / 角码 , Jiǎomǎ ( corner code ). There is the three-corner coding by Wang An and the four-corner number coding by Wang Yunwu et al. a.

Coding according to the sound of the character

Keyboard for "double spelling".

The coding according to the sound of the character is also called Pinyin - or Zhuyin - or Bopomofo input coding , depending on the phonetic spelling (Pinyin in China, Zhuyin or Bopomofo in Taiwan) . Pinyin is used in connection with intelligent input systems for Latin letters .

The characters are coded with their sound. Usually the important factors initial , final and tones come into play. The coding according to the sound of the characters can be further divided into types

"Complete spelling" ( 全拼 , quán pīn ),

“Double spelling” ( 雙拼 / 双拼 , shuāng pīn ) and

" Mixed spelling" ( 混拼 , hùn pīn ).

An example of the “complete spelling” of 中国 , Zhōngguó would be the following: You enter eight letters. The double spelling is "vsgo", you enter a code of four letters, of which "v" and "g" stand for the initials "zh" and "g", "s" and "o" each for the endings " ong "and" uo ". The mixed spelling is "jiaty", you enter a code of five letters.

Of the three types listed above, only the “complete spelling” corresponds to the standardized spelling for the Chinese script (pinyin) , the double spelling and the mixed spelling have been designed by the designers of the code. The above examples "double spelling" and "mixed spelling" are each a natural code and a special design for the CCDOS system.

Coding according to the sound and shape of the character

This type of coding is a combination of the coding according to the shape of the characters and the coding according to the sound of the characters. This can be divided into sound-form coding, form-sound coding, sound-meaning coding and others.

Current usage

Above, four ways of encoding or entering Chinese characters were given . From the current application point of view, those who can speak Chinese and understand the pinyin for Chinese favor the pinyin input method. Those who speak dialect prefer to use a coding based on the shape of the characters, so wubizixing is mastered by most professional typists.

Coding on the Internet

If you want to set your browser correctly when loading Chinese-language websites, you will usually encounter the following codes:

Big5

The character encoding Big5 comes from Taiwan and is used for traditional Chinese . ASCII characters are encoded in one byte and correspond to normal ASCII coding. Chinese characters are encoded in two bytes.

GB2312

The GB2312 character encoding is used for Simplified Chinese . ASCII characters are encoded in one byte and correspond to normal ASCII coding. Chinese characters are encoded in two bytes.

GB18030

The character encoding GB18030 is an extension of GB2312 to the Unicode character set and is used for simplified Chinese. ASCII characters are encoded in one byte and correspond to normal ASCII coding. Chinese characters are encoded in two or four bytes. In the version GB 18030-2000 110,000 characters are defined.

Unicode

Unicode differs from the other Chinese character encodings in that no distinction is made between Simplified and Traditional Chinese, but instead all Chinese, Japanese and Korean characters are identified as much as possible through the Han standardization .

Unicode Transformation Formats

Unicode first assigns abstract numbers (code points) to the characters, the conversion of which into byte sequences is defined in the Unicode Transformation Formats :

In UTF-8 , ASCII characters are encoded in one byte and Chinese characters in three or four bytes.
In UTF-16 , ASCII characters are encoded in two bytes and Chinese characters in two or four bytes.
In UTF-32 , all characters are encoded in four bytes without exception.

These Unicode Transformation Formats are also called encoding, which denotes the length of the storage variables (1, 2, 4 bytes) and endianness, which defines the byte order (big endian, little endian).

SIP

For a large number of rarely used characters, the codes are allocated in the Supplementary Ideographic Plane , i.e. H. in the range U + 20000-U + 2FFFF.

Other unicode areas

Unicode also has areas for bopomofo , radicals and special characters that are used for typography . The Latin characters with indication of the tone as they are used for pinyin are either coded individually or can be displayed using the area for combining diacritical marks.

Web links

GB18030 Summary PDF (English)
Input systems (under UNIX, Linux, BSD etc.): OXIM , SCIM / SKIM , XCIN
- Guide to Scim