Indian scripts in Unicode

from Wikipedia, the free encyclopedia

The Indian scripts in Unicode include the Indian group of scripts and thus not only a large part of the scripts used in India, but also other scripts that are used in Southeast Asia. Other Indian scripts that are not derived from the Brahmi script are also coded in Unicode. The correct display of these fonts sometimes requires complex algorithms that can be influenced by a few control characters.

Similarities

The Indian scripts belong to the Abugida class , and many of them have a very similar structure. Consonants can appear in two ways: On the one hand, as living consonants that carry a vowel . This can be the inherent vowel or some other dependent vowel. On the other hand, there are dead consonants that do not have a vowel. In addition to the dependent vowels, there are also independent ones.

A consonant with a dependent vowel can be represented in different ways. In the simplest case, the vowel mark complements the consonant mark, comparable to letters with diacritical marks . The vowel sign can appear in different positions, including before the consonant. In some cases, the vowel mark consists of two separate parts. A separate symbol for the combination of consonant and vowel symbols is also possible.

A dead consonant can also be represented in several ways. It often forms a ligature with the following consonant . Another possibility is to represent it in the so-called half form. This is a form derived from the consonant sign that can be interpreted as the basic component without the visual representation of the inherent vowel. Another possibility is to mark the dead consonant with an additional character called Virama .

Unicode encodes the following characters separately for all Indian scripts: Consonant characters and independent vowel characters are encoded as ordinary characters, characters for dependent vowels as combining characters . Virama is also coded as a combining symbol, which identifies a consonant as a dead consonant. This does not automatically determine how this is to be represented, in particular not every combination of consonant and virama has to be represented with a visible virama. Rather, there are a number of rules for every language that determine which sequences of dead and living consonants should be represented in which way. The font used must have the necessary glyphs for correct display . Another combining symbol is the nukta .

To explicitly select a specific representation of a dead consonant, the two control characters ZWJ ( widthless connector ) and ZWNJ ( widthless non- connector ) are used in Unicode . If a dead consonant is followed by a ZWJ, this is shown in the half-form, if it is followed by a ZWNJ, a visible virama is used.

Unicode thus follows the Indian standard ISCII -1988 both in the principle of the coding and in the relative position of the individual characters. In addition, Unicode encodes other characters, especially digits for the individual scripts.

Coded fonts

The following Indian scripts are also coded in the ISCII-1988 standard and all follow the above rules of presentation very closely.

font Unicode block
Devanagari Devanagari , Devanagari, extended , Vedic extensions
Bengali script Bengali
Gurmukhi script Gurmukhi
Gujarati script Gujarati
Oriya script Oriya
Tamil script Tamil
Telugu script Telugu
Kannada script Kannada
Malayalam script Malayalam

The following scripts, which are or have been used in South Asia, are also derived from the Brahmi script, but are not coded in the ISCII-1988 standard and their presentation partly deviates from the above rules.

font Unicode block
Sinhala script Sinhala
Tibetan script Tibetan
Lepcha script Lepcha
Phagpa script Phagspa
Limbu script Limbu
Sylheti Nagari Syloti Nagri
Kaithi script Kaithi
Saurashtri script Saurashtra
Sharada script Sharada
Takri script Takri
Chakma script Chakma
Meitei-Mayek Meitei-Mayek , Meitei-Mayek, extensions
Sorang-Sompeng Sorang-Sompeng
Brahmi script Brahmi

Scripts from the Indian group of scripts are also used outside of South Asia:

font Unicode block
Thai script Thai
Laotian script Laotian
Burmese script Burmese , Burmese, Extended-A , Burmese, Extended-B
Khmer script Khmer , Khmer symbols
Lanna script Lanna
Cham font Cham
Baybayin Tagalog
Hanunó'o Hanunóo
Buid font Buid
Tagbanuwa script Tagbanuwa
Lontara Buginese
Balinese script Balinese
Javanese script Javanese
Rejang script Rejang
Batak script Batak
Sundanese script Sundanese , Sundanese, complement

Two Indian scripts fall outside this framework. This is the one that in the Unicode block Ol Chiki coded Ol Chiki , an alphabet -Schrift, and the Unicode block Kharoshthi coded Kharoshthi font , although like the other writings a Abugida font is, but is written from right to left.

criticism

The Unicode encoding of the Tamil script has been criticized by a number of organizations, including the government of Tamil Nadu . Instead, TACE-16 proposed an alternative encoding that encodes the individual syllables instead of consonants and vowel signs. This coding allows, in particular, correct sorting without using complex algorithms such as the Unicode Collation Algorithm . A change to the Unicode standard was not made, as this contradicts the stability criteria of Unicode.

swell

Individual evidence

  1. FAQ: Tamil Language and Script , accessed February 19, 2013.

Web links