Han unification

from Wikipedia, the free encyclopedia
Example of Han standardization: the ideogram in abbreviations, traditional characters, Kanji and Hanja (from left to right)

With the term Han Unification ( English Han unification ) the unification of China is in the computer science Hanzi , Japanese Kanji and Korean Hanja ( CJD ), more rarely, the Vietnamese Chữ nôm , in a font called. The term is mostly used in connection with Unicode and the Han standardization carried out there.

The idea of ​​uniting the various Han scripts in one character set is not new - as early as 1980, the Chinese Character Code for Information Interchange (CCCII) was a character set that combined abbreviations, traditional characters and Kanji. This idea was also followed when the Unicode standard was developed. In February 1990 a group specializing in Han standardization, the CJK-IRG , was founded. This group was renamed IRG a little later .

When China announced the development of a new character set, GB 13000, Unicode and China agreed to jointly develop the Han character set.

Han unification in Unicode

Table as a graphic

The Ideographic Rapporteur Group (IRG) is responsible for the Han standardization in Unicode, which checks all coding suggestions and locates characters that can be combined. The standardization in Unicode follows strict rules:

  • To make the conversion from older character sets to Unicode easier, the source separation rule was used for the 20,902 characters of the first Unicode version , which states that two ideograms that are differentiated in an older character set are also differentiated in Unicode. This rule is no longer used for later coded CJK ideograms.
  • If ideograms are not related in historical meaning, they are also not unified. This applies e.g. B. towards the characters (earth) and (warrior), which look similar, but have completely different meanings and origins.

Then the ideograms are broken down into their individual lines. Then the number and position of the bars, the structure, the coding in an older character set and the radical of the characters are determined. If everything is the same, the signs are united, otherwise not.

Most of the time, characters are simplified if only they look different in the different writing styles of the Chinese script. For example, the radical (as radical ) is written with either one or two upper dots in the document. In regular script and handwriting, however, this symbol has only one point everywhere. It is similar with the radical, which is still written like a in the classic print script ( Ming ), but is written in handwriting and regular script. Since, after the writing reforms in the People's Republic of China and Japan, attempts were made to adapt the print to the handwriting, in Korea not at all and limited in Taiwan, these differences occur.

The following table shows the different representation of a character per line for different CJK fonts (Chinese without further information for display in the browser; Chinese abbreviations as used in the People's Republic of China, Singapore and Malaysia; Chinese traditional characters as in the Republic of China (Taiwan) , Hong Kong and Macau; Japanese; Korean), which comes from the respective font-specific characteristics. These can come from the order of the lines, the number of lines or the direction. For proper functionality, the appropriate fonts must be installed and the browser must correctly select the appropriate one. If this is not the case, you can alternatively look at the graphic on the right.

code Chinese
(General)
Chinese
abbreviations

Traditional Chinese characters
Japanese Korean
U + 4E0E
U + 4ECA
U + 4EE4
U + 514D
U + 5165
U + 5168
U + 5177
U + 5203
U + 5316
U + 5340
U + 5916
U + 60C5
U + 624D
U + 6B21
U + 6D77
U + 6F22
U + 753B
U + 76F4
U + 771F
U + 7A7A
U + 7D00
U + 8349
U + 89D2
U + 8ACB
U + 9053
U + 9913
U + 9AA8

On the other hand, individual character variants were also included separately in Unicode, which is shown as an example in the following table:

code Chinese
(General)
Chinese
abbreviations

Traditional Chinese characters
Japanese Korean
U + 9AD8
U + 9AD9
U + 7D05
U + 7EA2
U + 4E1F
U + 4E22
U + 4E57
U + 4E58
U + 4FA3
U + 4FB6
U + 514C
U + 5151
U + 5167
U + 5185
U + 7522
U + 7523
U + 7A05
U + 7A0E
U + 4E80
U + 9F9C
U + 9F9F
U + 5225
U + 522B
U + 4E21
U + 4E24
U + 5169

criticism

In East Asia, Han standardization is criticized mainly for cultural, but also for technical reasons.

Historically, there was no exact separation between glyph and characters in both Chinese and Japanese . When designing Unicode, the consortium had the choice of either introducing this differentiation systematically or of doing without it altogether and coding each variation separately. This would have led to numerous variants for numerous semantically identical characters, in particular also to variants that cannot be clearly delimited by the language area (classic Chinese, simplified Chinese, Japanese, Korean), but only historically.

Today's Unicode standard represents a compromise. Complete standardization based solely on semantic criteria has been dispensed with. There were practical reasons for this. It was the declared goal that modern Chinese, Japanese and Korean can be differentiated in the same text without changing fonts. Classical texts can also be clearly mapped semantically in Unicode 3.1. Only the representation of historical variations, which can be interesting in a linguistic context, is not possible in Unicode 3.1.

Another problem was the inability to specify different variants of a character in a text without markup . This is particularly problematic in Japanese, where some place names and names still use the old radicals. For example, the first character of the district of Gion 祇 園 of Kyōto is not written with , but with , although other words with 祇 are written with the radical.

Unicode 3.2 addressed this problem with variant selectors . Standardized variants and historically used forms and characters have been and are constantly being added, for example in the Unicode block Unified CJK Ideograms, Extension A (Unicode 3.0), Unicode Block Unified CJK Ideograms, Extension B (Unicode 3.1), Unicode Block Unified CJK Ideograms, Extension C. (Unicode 5.2), Unicode Block Unified CJK Ideograms, Extension D (Unicode 6.0), Unicode Block Unified CJK Ideograms, Extension E (Unicode 8.0) and Unicode Block Unified CJK Ideograms, Extension F (Unicode 10.0).

Web links