Arabic and Syriac in Unicode

from Wikipedia, the free encyclopedia

The characters for Arabic and Syriac are in Unicode in eight different Unicode blocks . In addition to the individual characters, the Unicode standard also defines a number of algorithms for the correct display of Arabic and Syrian texts.

Coded characters

The most important characters for Arabic are in the Arabic Unicode block . In addition to the letters of the Arabic alphabet , which correspond to ISO 8859-6 in terms of scope and arrangement , there are also digits, some punctuation marks that are very different from those used with the Latin script, and special characters. Even if a letter has different forms of representation depending on its position in the word, this block only contains one character for all variants.

The Arabic alphabet is also used in other languages, adding a few more characters to it. For example, there are four additional letters in the Persian alphabet . Such letters are found along with characters that are no longer in use in the blocks Arabic, Complementary and Arabic, extended-A .

The two blocks Arabic Presentation Forms-A and Arabic Presentation Forms-B contain presentation variants and ligatures , especially for compatibility with other standards.

The Unicode block Arab mathematical alphanumeric symbols finally contains Arabic letters for use in mathematical formulas.

The letters of the Syrian alphabet are in the Unicode block Syrian . In contrast to Arabic, there are no characters here that are coded multiple times in different forms of representation.

In addition to these characters, the bidirectional control characters and the widthless connector or non- connector play a role in digital Arabic and Syrian typography.

Writing direction

Arabic and Syriac are written from right to left, only numbers - regardless of the digits used - are written from left to right. Some punctuation marks, such as brackets, are shown mirrored to the usual variant. For the correct display, the Unicode standard provides the Unicode bidi algorithm as for other left-hand scripts .

Context-dependent letter forms

Different forms of Arabic letters:
iii) isolated form
iv) form connected to the right
v) form connected on both sides
vi) form connected to the left

Depending on its position in the word, an Arabic letter can appear in up to four different forms of representation: As an isolated letter (e.g. in character tables), as a letter at the beginning of the word where it connects to the following letter on the left, at the end of a word where it is associated with connects to the previous letter on the right, and in the middle of the word, where it connects to both neighbors. A font must therefore have up to four different glyphs for a single character . To select the correct glyph depending on the context, the following algorithm is used:

For this purpose, Unicode each character a Joining_Type- property to. This property indicates whether and in which direction the character connects to the neighboring characters. There are six different values:

  • Rfor characters such as Alif or Dāl , which are only connected to the right
  • Lfor characters that are only joined to the left. In Arabic there is no character with this value, but it is used in the Phagpa script and for Manichaean .
  • Dfor characters such as Ba or Ta , which are connected on both sides
  • Cfor signs such as the Kaschida sign or the broad connector, which also initiate a connection on both sides, but remain unchanged themselves
  • U for characters that do not connect with their neighbors, such as all Latin letters, or the non-lathe non-connector.
  • Tfor characters such as combining characters that should be ignored when applying the algorithm.

This property is used to determine the form in which a character should be displayed according to a set of rules:

The type of characters R, where a character type L, Dor Cprecedes (where the type of characters Tto be skipped), are shown in the connected right form, analog characters of the type Lto which a character type R, Dor Cfollows (where the type of characters Tbypassed are shown in the form connected to the left.

For characters of the type D, both these rules are applied, if there are suitable characters on both sides, the shape connected to both sides is selected, if such a character is only on one side and not on the other, the corresponding connected shape is selected.

If none of the rules apply, the character is displayed in the unconnected form.

This algorithm is also used for the Syriac script, with special additional rules applying to the Syrian letter Olaf.

Other writing systems in which this algorithm is used are N'Ko , Mongolian , Phagpa , Manichean and Psalter-Pahlavi .

Ligatures

Another peculiarity in Arabic and Syriac are certain ligatures that differ significantly in appearance from the composite individual letters that make them up.

The Unicode standard contains another property for the correct display of ligatures Joining_Group. This can take on various values, which are named after the letters of this group. So Lam and letters derived from it all have the value Lam. If such a character is followed by a letter from the group Alef(to which Alif and derived characters belong), then these two characters are represented by the Lām-Alif ligature.

Other special features

Syriac abbreviation symbol

Some characters require a special representation, e.g. U + 06DD, end of Āya . This character surrounds all digits immediately following. In order to recognize a character as a number, computer systems can fall back on the general category of the character. The same applies to the characters at code points U + 0600 to U + 0603, which underline general numbers, years, footnotes and page numbers. In Syriac there is the Syrian abbreviation symbol (U + 070F), which indicates the beginning of an abbreviation, which should then be marked with a line overlaid with individual dots. The example opposite shows the first four letters of the Syrian alphabet, the last three of which are spanned by the Syrian abbreviation.

swell

  • Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 8.2: Arabic, Chapter 8.3: Syriac. ( online , PDF)