Writing systems in Unicode
As a system of writing ( English script ) is in Unicode called a group of characters, known collectively as the font used. In most cases, the writing systems roughly match the Unicode blocks, but there are writing systems that are distributed over several blocks and blocks that contain characters from different writing systems. Writing systems are independent of languages . While there are cases where the writing system and language are the same, many writing systems use several different languages to write. The Latin alphabet is used as a script in German, English, French, Vietnamese and many other languages. Conversely, a language can use several scripts. In the past, Turkish was written in Arabic script, whereas today the Latin alphabet is used. It is not always possible to clearly determine whether two fonts belong to a common writing system or not. Unicode sees the Japanese Kanji as a simple variant of the Chinese characters and combines them with these in the course of the Han standardization . The Coptic alphabet was originally viewed as an extension of the Greek and was only later coded in Unicode as an independent writing system. A total of 135 different writing systems are coded in Unicode 9.0.
Formal definition
The writing system to which a character belongs is formally determined by two properties . In most cases the Script
property provides the necessary information, it gives the English name of the writing system. There are a total of 139 different values. Three of these values have a special meaning:
-
Unknown
identifies characters whose writing system cannot be determined. In addition to not yet assigned code points, this also applies to characters from the area for private use. -
Inherited
(564 characters) mainly denotes combining characters . These are coded by appearance, not by use. The acute acute is used with both Latin and Greek letters. When determining the writing system, such characters take on the value of the preceding character. -
Common
Finally, (7279 characters) denotes characters that can be used in several writing systems. While some of these characters are only used in a few related writing systems, punctuation marks and symbols can be used with all writing systems.
There is also one value for each of the 135 writing systems and another for Braille characters . Although these are considered symbols, they have their own value for the Script
property.
In some cases, the Script_Extensions
property provides more precise information about the writing system . For characters with the value Inherited
or Common
that are only used in a few writing systems, it lists these writing systems.
use
The Script
property can be used in several ways. It can be used to recognize the script with which a text is written or to find words from a specific script in a document. To do this, some regular expression implementations allow the use of Unicode properties.
Another use is to defend against spoofing attacks. Using this property , a browser can recognize that www.unicоde.org
the о is not a Latin, but a Cyrillic letter and warn the user of a URL spoofing attempt.
list
The following list names all writing systems that are represented in Unicode 9.0 with at least 100 characters.
- font
- indicates the German name of the font
Script
- is the name under which the writing system is known in Unicode
- Type
- classifies the writing systems according to the type of structure. Unicode differentiates between the following types: alphabet , Abdschad , syllabary , Abugida , logography
- number
- Specifies the number of characters that are assigned to this writing system, including the characters that are used
Script_Extensions
in this writing system according to the property. In this case the division is also given in brackets. - Unicode
- refers to further information concerning this font in connection with Unicode.
font | Script |
Type | number | Unicode |
---|---|---|---|---|
Latin alphabet | Latin | alphabet | 1370 (1350 + 20) | Latin characters in Unicode |
Greek alphabet | Greek | alphabet | 522 (518 + 4) | Greek and Coptic in Unicode |
Coptic script | Coptic | alphabet | 165 (137 + 28) | |
Cyrillic alphabet | Cyrillic | alphabet | 450 (443 + 7) | Cyrillic and Glagolitic in Unicode |
Glagolitic script | Glagolitic | alphabet | 136 (132 + 4) | |
Hebrew alphabet | Hebrew | Abdschad | 133 | Unicode block Hebrew |
Arabic writing | Arabic | Abdschad | 1335 (1279 + 56) | Arabic and Syriac in Unicode |
Devanagari | Devanagari | Abugida | 212 (154 + 68) | Indian scripts in Unicode |
Bengali script | Bengali | Abugida | 108 (93 + 15) | |
Gurmukhi script | Gurmukhi | Abugida | 103 (79 + 24) | |
Gujarati script | Gujarati | Abugida | 109 (85 + 24) | |
Telugu script | Telugu | Abugida | 101 (96 + 5) | |
Kannada script | Kannada | Abugida | 100 (88 + 12) | |
Malayalam script | Malayalam | Abugida | 119 (114 + 5) | |
Sinhala script | Sinhala | Abugida | 112 (110 + 2) | |
Tibetan script | Tibetan | Abugida | 207 | |
Burmese script | Myanmar | Abugida | 234 (223 + 11) | |
Khmer script | Khmer | Abugida | 146 | |
Balinese script | Balinese | Abugida | 121 | |
Lanna script | Tai_Tham | Abugida | 127 | |
Brahmi script | Brahmi | Abugida | 109 | |
Sharada script | Sharada | Abugida | 100 (94 + 6) | |
Grantha script | Grantha | Abugida | 115 (85 + 30) | |
Georgian alphabet | Georgian | alphabet | 129 (127 + 2) | |
Korean alphabet | Hangul | Syllabary | 11775 (11739 + 36) | East Asian fonts in Unicode |
Hiragana | Hiragana | Syllabary | 143 (91 + 52) | |
Katakana | Katakana | Syllabary | 352 (300 + 52) | |
Zhuyin | Bopomofo | Syllabary | 110 (70 + 40) | |
Chinese letters | Han | Logography | 82013 (81734 + 279) | |
Yi script | Yi | Syllabary | 1246 (1220 + 26) | |
Xixia font | Tangut | Logography | 6881 | |
Ethiopian script | Ethiopic | Syllabary | 495 | |
Cherokee syllabary | Cherokee | Syllabary | 172 | |
Cree font | Canadian_Aboriginal | Syllabary | 710 | |
Mongolian script | Mongolian | alphabet | 169 (166 + 3) | |
Linear font B | Linear_B | Syllabary | 268 (211 + 57) | Historical fonts in Unicode |
Linear font A | Linear_A | Logography | 386 (341 + 45) | |
Cypriot script | Cypriot | Syllabary | 112 (55 + 57) | |
Cuneiform | Cuneiform | Logography | 1234 | |
Egyptian hieroglyphics | Egyptian_Hieroglyphs | Logography | 1071 | |
Braille | Braille | (Notation system) | 256 | Symbols in Unicode |
Vai font | Vai | Syllabary | 300 | |
Bamun script | Bamum | Syllabary | 657 | |
Pollard font | Miao | Syllabary | 133 | |
Duployé shorthand | Doployan | (Notation system) | 147 (143 + 4) | |
Pahawh Hmong | Pahawh_Hmong | alphabet | 127 | |
Mende font | Mende_Kikakui | Syllabary | 213 | |
Hieroglyphics Luwish | Anatolian_Hieroglyphs | Logography | 583 | |
Old Hungarian script | Old_Hungarian | alphabet | 108 | |
SignWriting | SignWriting | (Notation system) | 672 |
swell
- Mark Davis, Ken Whistler: Unicode Standard Annex # 24: Unicode Script Property. ( Online )
- Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 6.1: Writing Systems. ( online , PDF)
- Scripts.txt , ScriptExtensions.txt (Unicode 9.0)
Web links
- Supported Scripts - all writing systems in Unicode with the time of their inclusion (English)
- Code Charts - all Unicode blocks, grouped according to writing systems (English)