Writing systems in Unicode

from Wikipedia, the free encyclopedia

As a system of writing ( English script ) is in Unicode called a group of characters, known collectively as the font used. In most cases, the writing systems roughly match the Unicode blocks, but there are writing systems that are distributed over several blocks and blocks that contain characters from different writing systems. Writing systems are independent of languages . While there are cases where the writing system and language are the same, many writing systems use several different languages ​​to write. The Latin alphabet is used as a script in German, English, French, Vietnamese and many other languages. Conversely, a language can use several scripts. In the past, Turkish was written in Arabic script, whereas today the Latin alphabet is used. It is not always possible to clearly determine whether two fonts belong to a common writing system or not. Unicode sees the Japanese Kanji as a simple variant of the Chinese characters and combines them with these in the course of the Han standardization . The Coptic alphabet was originally viewed as an extension of the Greek and was only later coded in Unicode as an independent writing system. A total of 135 different writing systems are coded in Unicode 9.0.

Formal definition

The writing system to which a character belongs is formally determined by two properties . In most cases the Scriptproperty provides the necessary information, it gives the English name of the writing system. There are a total of 139 different values. Three of these values ​​have a special meaning:

  • Unknownidentifies characters whose writing system cannot be determined. In addition to not yet assigned code points, this also applies to characters from the area for private use.
  • Inherited(564 characters) mainly denotes combining characters . These are coded by appearance, not by use. The acute acute is used with both Latin and Greek letters. When determining the writing system, such characters take on the value of the preceding character.
  • CommonFinally, (7279 characters) denotes characters that can be used in several writing systems. While some of these characters are only used in a few related writing systems, punctuation marks and symbols can be used with all writing systems.

There is also one value for each of the 135 writing systems and another for Braille characters . Although these are considered symbols, they have their own value for the Scriptproperty.

In some cases, the Script_Extensionsproperty provides more precise information about the writing system . For characters with the value Inheritedor Commonthat are only used in a few writing systems, it lists these writing systems.

use

The Scriptproperty can be used in several ways. It can be used to recognize the script with which a text is written or to find words from a specific script in a document. To do this, some regular expression implementations allow the use of Unicode properties.

Another use is to defend against spoofing attacks. Using this property , a browser can recognize that www.unicоde.orgthe о is not a Latin, but a Cyrillic letter and warn the user of a URL spoofing attempt.

list

The following list names all writing systems that are represented in Unicode 9.0 with at least 100 characters.

font
indicates the German name of the font
Script
is the name under which the writing system is known in Unicode
Type
classifies the writing systems according to the type of structure. Unicode differentiates between the following types: alphabet , Abdschad , syllabary , Abugida , logography
number
Specifies the number of characters that are assigned to this writing system, including the characters that are used Script_Extensionsin this writing system according to the property. In this case the division is also given in brackets.
Unicode
refers to further information concerning this font in connection with Unicode.
font Script Type number Unicode
Latin alphabet Latin alphabet 1370 (1350 + 20) Latin characters in Unicode
Greek alphabet Greek alphabet 522 (518 + 4) Greek and Coptic in Unicode
Coptic script Coptic alphabet 165 (137 + 28)
Cyrillic alphabet Cyrillic alphabet 450 (443 + 7) Cyrillic and Glagolitic in Unicode
Glagolitic script Glagolitic alphabet 136 (132 + 4)
Hebrew alphabet Hebrew Abdschad 133 Unicode block Hebrew
Arabic writing Arabic Abdschad 1335 (1279 + 56) Arabic and Syriac in Unicode
Devanagari Devanagari Abugida 212 (154 + 68) Indian scripts in Unicode
Bengali script Bengali Abugida 108 (93 + 15)
Gurmukhi script Gurmukhi Abugida 103 (79 + 24)
Gujarati script Gujarati Abugida 109 (85 + 24)
Telugu script Telugu Abugida 101 (96 + 5)
Kannada script Kannada Abugida 100 (88 + 12)
Malayalam script Malayalam Abugida 119 (114 + 5)
Sinhala script Sinhala Abugida 112 (110 + 2)
Tibetan script Tibetan Abugida 207
Burmese script Myanmar Abugida 234 (223 + 11)
Khmer script Khmer Abugida 146
Balinese script Balinese Abugida 121
Lanna script Tai_Tham Abugida 127
Brahmi script Brahmi Abugida 109
Sharada script Sharada Abugida 100 (94 + 6)
Grantha script Grantha Abugida 115 (85 + 30)
Georgian alphabet Georgian alphabet 129 (127 + 2)
Korean alphabet Hangul Syllabary 11775 (11739 + 36) East Asian fonts in Unicode
Hiragana Hiragana Syllabary 143 (91 + 52)
Katakana Katakana Syllabary 352 (300 + 52)
Zhuyin Bopomofo Syllabary 110 (70 + 40)
Chinese letters Han Logography 82013 (81734 + 279)
Yi script Yi Syllabary 1246 (1220 + 26)
Xixia font Tangut Logography 6881
Ethiopian script Ethiopic Syllabary 495
Cherokee syllabary Cherokee Syllabary 172
Cree font Canadian_Aboriginal Syllabary 710
Mongolian script Mongolian alphabet 169 (166 + 3)
Linear font B Linear_B Syllabary 268 (211 + 57) Historical fonts in Unicode
Linear font A Linear_A Logography 386 (341 + 45)
Cypriot script Cypriot Syllabary 112 (55 + 57)
Cuneiform Cuneiform Logography 1234
Egyptian hieroglyphics Egyptian_Hieroglyphs Logography 1071
Braille Braille (Notation system) 256 Symbols in Unicode
Vai font Vai Syllabary 300
Bamun script Bamum Syllabary 657
Pollard font Miao Syllabary 133
Duployé shorthand Doployan (Notation system) 147 (143 + 4)
Pahawh Hmong Pahawh_Hmong alphabet 127
Mende font Mende_Kikakui Syllabary 213
Hieroglyphics Luwish Anatolian_Hieroglyphs Logography 583
Old Hungarian script Old_Hungarian alphabet 108
SignWriting SignWriting (Notation system) 672

swell

  • Mark Davis, Ken Whistler: Unicode Standard Annex # 24: Unicode Script Property. ( Online )
  • Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 6.1: Writing Systems. ( online , PDF)
  • Scripts.txt , ScriptExtensions.txt (Unicode 9.0)

Web links

  • Supported Scripts - all writing systems in Unicode with the time of their inclusion (English)
  • Code Charts - all Unicode blocks, grouped according to writing systems (English)