Indian scripts in Unicode

The Indian scripts in Unicode include the Indian group of scripts and thus not only a large part of the scripts used in India, but also other scripts that are used in Southeast Asia. Other Indian scripts that are not derived from the Brahmi script are also coded in Unicode. The correct display of these fonts sometimes requires complex algorithms that can be influenced by a few control characters.

Similarities

The Indian scripts belong to the Abugida class , and many of them have a very similar structure. Consonants can appear in two ways: On the one hand, as living consonants that carry a vowel . This can be the inherent vowel or some other dependent vowel. On the other hand, there are dead consonants that do not have a vowel. In addition to the dependent vowels, there are also independent ones.

A consonant with a dependent vowel can be represented in different ways. In the simplest case, the vowel mark complements the consonant mark, comparable to letters with diacritical marks . The vowel sign can appear in different positions, including before the consonant. In some cases, the vowel mark consists of two separate parts. A separate symbol for the combination of consonant and vowel symbols is also possible.

A dead consonant can also be represented in several ways. It often forms a ligature with the following consonant . Another possibility is to represent it in the so-called half form. This is a form derived from the consonant sign that can be interpreted as the basic component without the visual representation of the inherent vowel. Another possibility is to mark the dead consonant with an additional character called Virama .

Unicode encodes the following characters separately for all Indian scripts: Consonant characters and independent vowel characters are encoded as ordinary characters, characters for dependent vowels as combining characters . Virama is also coded as a combining symbol, which identifies a consonant as a dead consonant. This does not automatically determine how this is to be represented, in particular not every combination of consonant and virama has to be represented with a visible virama. Rather, there are a number of rules for every language that determine which sequences of dead and living consonants should be represented in which way. The font used must have the necessary glyphs for correct display . Another combining symbol is the nukta .

To explicitly select a specific representation of a dead consonant, the two control characters ZWJ ( widthless connector ) and ZWNJ ( widthless non- connector ) are used in Unicode . If a dead consonant is followed by a ZWJ, this is shown in the half-form, if it is followed by a ZWNJ, a visible virama is used.

Unicode thus follows the Indian standard ISCII -1988 both in the principle of the coding and in the relative position of the individual characters. In addition, Unicode encodes other characters, especially digits for the individual scripts.

Coded fonts

The following Indian scripts are also coded in the ISCII-1988 standard and all follow the above rules of presentation very closely.

font	Unicode block
Devanagari	Devanagari , Devanagari, extended , Vedic extensions
Bengali script	Bengali
Gurmukhi script	Gurmukhi
Gujarati script	Gujarati
Oriya script	Oriya
Tamil script	Tamil
Telugu script	Telugu
Kannada script	Kannada
Malayalam script	Malayalam

The following scripts, which are or have been used in South Asia, are also derived from the Brahmi script, but are not coded in the ISCII-1988 standard and their presentation partly deviates from the above rules.

font	Unicode block
Sinhala script	Sinhala
Tibetan script	Tibetan
Lepcha script	Lepcha
Phagpa script	Phagspa
Limbu script	Limbu
Sylheti Nagari	Syloti Nagri
Kaithi script	Kaithi
Saurashtri script	Saurashtra
Sharada script	Sharada
Takri script	Takri
Chakma script	Chakma
Meitei-Mayek	Meitei-Mayek , Meitei-Mayek, extensions
Sorang-Sompeng	Sorang-Sompeng
Brahmi script	Brahmi

Scripts from the Indian group of scripts are also used outside of South Asia:

font	Unicode block
Thai script	Thai
Laotian script	Laotian
Burmese script	Burmese , Burmese, Extended-A , Burmese, Extended-B
Khmer script	Khmer , Khmer symbols
Lanna script	Lanna
Cham font	Cham
Baybayin	Tagalog
Hanunó'o	Hanunóo
Buid font	Buid
Tagbanuwa script	Tagbanuwa
Lontara	Buginese
Balinese script	Balinese
Javanese script	Javanese
Rejang script	Rejang
Batak script	Batak
Sundanese script	Sundanese , Sundanese, complement

Two Indian scripts fall outside this framework. This is the one that in the Unicode block Ol Chiki coded Ol Chiki , an alphabet -Schrift, and the Unicode block Kharoshthi coded Kharoshthi font , although like the other writings a Abugida font is, but is written from right to left.

criticism

The Unicode encoding of the Tamil script has been criticized by a number of organizations, including the government of Tamil Nadu . Instead, TACE-16 proposed an alternative encoding that encodes the individual syllables instead of consonants and vowel signs. This coding allows, in particular, correct sorting without using complex algorithms such as the Unicode Collation Algorithm . A change to the Unicode standard was not made, as this contradicts the stability criteria of Unicode.

swell

Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 9: South Asian Scripts-I (PDF; 2.0 MB), Chapter 10: South Asian Scripts-II (PDF; 724 kB), Chapter 11: Southeast Asian Scripts (PDF; 674 kB).

Individual evidence

↑ FAQ: Tamil Language and Script , accessed February 19, 2013.

Web links

FAQ: Indic Scripts and Languages (English)
Richard Ishida: An Introduction to Indic Scripts (English; PDF; 340 kB)

[1] FAQ: Tamil Language and Script , accessed February 19, 2013.