East Asian fonts in Unicode

As East Asian characters , a number of fonts are summarized in Unicode that were created in the East Asian cultural area and are used there. In addition to the most extensive and oldest group, the Chinese characters , these are fonts that are used in the neighboring countries of China and that are partly influenced by the Chinese characters: the two Japanese syllabary scripts Hiragana and Katakana , the Korean alphabet Hangeul and the syllabary Yi . In addition, the Chinese phonetic syllabary Bopomofo is encoded in Unicode.

Coded characters

Chinese characters

Chinese characters are not only used for the Chinese language. Originally, Japanese was also written exclusively with the Chinese characters called Kanji ; today they are used together with Hiragana and Katakana. Korean also originally used Chinese characters, known as Hanja . In the course of time, different forms and meanings of the individual characters appeared in the individual languages. With the inclusion in Unicode, therefore, the question had to be clarified whether the characters should be encoded individually for each language or only once for all languages together. It was decided to combine the variants in the different languages into a single Unicode character and, in the course of the Han standardization, extracted the Chinese characters encoded in Unicode under the name CJK from various national standards .

The order in which the characters are coded was essentially based on the Kangxi dictionary .

The following table lists the blocks that contain Chinese characters. The "Period" column indicates when the characters were proposed to the responsible working group for coding. In the future, more Chinese characters will be included in Unicode, whereby it is assumed that more than half of all possible characters are now encoded.

block	Area	level	Number of occupied code points	Period	use
Unified CJK ideograms	4E00-9FFF	0	20,941	until 1992 with later additions	frequently
Unified CJK Ideograms, Extension A	3400-4DBF	0	6,582	1992-1998	Rare
Unified CJK ideograms, extension B	20000-2A6DF	2	42,711	1998-2002	historical
Unified CJK Ideograms, Extension C	2A700-2B73F	2	4.149	2002-2006	historical
Unified CJK ideograms, extension D	2B740-2B81F	2	222	2006-2009	rather seldom

In addition to these blocks, there are two blocks with CJK-Ideograms, Compatibility and CJK-Ideograms, Compatibility, Supplement , which (with twelve exceptions) contain compatibility symbols that could actually have been combined with other symbols, but their own for compatibility with other standards Code point assigned.

Radicals are encoded in the blocks Kangxi Radicals and CJK Radicals, Supplement Extra.

Individual lines from which the characters are constructed are coded in the Unicode block CJK lines . They can be used in an index for Chinese dictionaries, for example.

A character that has not yet been coded in Unicode can be replaced by an ideographic description sequence.

The block Unicode block Ideographic description characters contains a number of characters that make it possible to describe characters that have not yet been coded based on their structure. To do this, the new character is broken down into two or three known characters. These are preceded by an ideographic descriptive symbol that indicates how these characters are to be combined. The graphic opposite shows how a character that has not yet been included in the Unicode standard can be replaced by such an ideographic description sequence: In “⿰ 書史” the first character indicates that it is a character that can be split vertically into two Can split halves, the following two characters indicate what these halves look like. There are also descriptive symbols for other combinations of characters. If the missing characters are very complex, it is also possible to nest description sequences, i.e. to describe one of the basic characters used again with such a sequence.

The unicode block Kanbun contains some characters that are used in Japanese for annotations in Chinese texts.

Bopomofo

The phonetic transcription font Bopomofo or Zhuyin is encoded in the two blocks Bopomofo and Bopomofo, extended . Only the tone characters are missing , they are in the block Unicode block Spacing Modifier Letters .

Hiragana and Katakana

The two main blocks for the Japanese syllabary scripts Hiragana and Katakana , the Unicode block Hiragana and the Unicode block Katakana are structured in parallel and essentially follow the JIS X 0208 standard .

Further Japanese characters can be found in the blocks Katakana, Phonetic Extensions and Kana, Supplement .

Hangeul

For the Korean script , Unicode provides single jamo in the blocks Hangeul-Jamo , Hangeul-Jamo, extended-A and Hangeul-Jamo, extended-B . These are then put together in blocks of syllables when they are displayed. For the most important of these blocks of syllables there are already compound syllables in the Unicode block of Hangeul syllables. The order of the coding is chosen in such a way that the breaking down of the syllables into individual Jamo and the reversal, for example during normalization, can be carried out algorithmically easily. For compatibility with the Korean standard KS X 1001 , the unicode block Hangeul-Jamo, compatibility also defines individual Jamo, which, however, do not combine to form syllables.

Yi

The modern syllabary font Yi is encoded in Unicode in two blocks. The unicode block Yi syllable characters contains the actual syllable characters, the unicode block Yi radicals the radicals that make up the script. As with the Chinese radical characters, these are primarily intended for use in indices.

More characters

In addition to the characters, there are other characters that are derived from these or used together with them.

Punctuation marks and some symbols especially for East Asian scripts can be found in the block Unicode block CJK symbols and punctuation . Additional symbols derived from or used with these characters can be found in the blocks CJK Enclosed Characters and Months , Additional CJK Enclosed Characters, and CJK Compatibility . For compatibility with other standards in the blocks are vertical forms (for GB 18030 ) and CJK Compatibility Forms (for CNS 11643 ) some punctuation explicitly coded in the form that they take in the vertical layout. Also for compatibility with CNS 11643, the Unicode block Small Form Variants encodes some punctuation marks in a small variant.

The Unicode block half-width and full-width forms is also available for compatibility with older standards : Most of the character encodings for East Asian fonts use a one-byte character set that is based on ASCII , parallel to a multi-byte character set for the CJK characters. The number of bytes corresponds to the width of the characters: The one-byte characters are only displayed with half the width. Many of these character sets encode all characters of the ASCII area one more time with several bytes as full-width forms, conversely, some characters, including Katakana, were also included in character sets in half-width. Unicode therefore also makes the double-coded characters available again as full-width or half-width characters.

presentation

The traditional direction of writing in East Asian scripts is in columns from top to bottom. The columns themselves are usually arranged from right to left. The characters all have the same width and height. However, proportional fonts with a writing direction in lines from left to right are now also used .

Some characters have a different appearance depending on whether they appear in vertical or horizontal text, this applies in particular to punctuation marks, but also to Latin letters; these are usually shown rotated by 90 ° in the vertical text.

When deciding which characters should be rotated in the vertical layout and which characters should be placed horizontally in a proportional font, the Unicode property East_Asian_Width can be used, from which it can be read off whether a character is wide, i.e. how a Chinese character behaves, for example, or whether it is narrow and is treated like a Latin letter. For the vertical layout, there is an alternative algorithm described in the Unicode Technical Report # 50 , which is based on a special property that was specially defined for this algorithm.

This property can have one of four different values for each character: Umeans that the character should also be displayed upright in the vertical layout, Ridentifies characters that are rotated 90 ° clockwise. There are also two other values, Tuand Tr. There is a special typographical variant for characters with these values, only if it is not possible to use this for some reason, the value is treated as Uor R. First, the text is broken down by the Unicode segmentation algorithm for graphemes, the first character of a grapheme determines the orientation, except for graphemes with an enclosing combining character , which are always displayed upright.

Unicode does not provide a special mechanism for selecting the correct glyph variant for the language for CJK characters . In most cases the reader will have the correct font set as the default font, and even if the characters (such as Chinese quotations in Japanese text) do not appear in the expected form, they will still be legible. If, on the other hand, the exact presentation is important, suitable meta information must be added to the text. One possibility for this is the now deprecated language tags . A special glyph variant can also be selected for individual characters using a variant selector. It is also possible to use higher-level protocols such as HTML to transfer information about the language or the desired font.

Bopomofo is often used as a comment on text written in Chinese characters, depending on the direction of writing, these comments should be displayed vertically next to or horizontally above the annotated text. techniques such as ruby or the use of annotation marks are useful here .

swell

Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . ( online ) Chapter 12: East Asian Scripts. ( PDF )
Ken Lunde: Unicode Standard Annex # 11: East Asian Width. ( online )
Koji Ishii: Unicode Technical Report # 50: Unicode Vertical Text Layout. ( online )

Individual evidence

↑ FAQ: Chinese and Japanese Retrieved February 18, 2013.

Web links

The secret life of Unicode : East Asian issues

[1] FAQ: Chinese and Japanese Retrieved February 18, 2013.