COMBINING SIGN

from Wikipedia, the free encyclopedia
y with breve

Combining characters ( English combining characters / marks ) are special characters in digital typography that are not normally displayed individually, but are combined with the preceding character to form a single character. This is mainly used to create arbitrary diacritical marks . For example, the lowercase letter y followed by the character combining certification results in y̆, a character that cannot be represented in Unicode without combining characters. In terms of concept, the combining characters can therefore be compared with the dead keys on the keyboard.

Formal basics

The scope and use of combining characters differ between different character encodings . ISO 6937 knows a number of combining characters for diacritical marks, but only allows certain combinations. For a complete display, it is therefore sufficient if the font used provides its own glyphs for these combinations. Alternatively, the coding can also be understood as a coding in which simple letters are represented by a code point , while letters with diacritical marks are represented by a sequence of two code points. In this standard, the combining characters are placed in front of the letter with which they are combined, contrary to the usual behavior.

Combining characters are not only used for diacritical marks; the codes from ISCII -1988 for various Indian scripts also use combining characters for vowel marks .

Unicode offers the most extensive collection of combining characters, along with a number of rules for their representation. Unicode allows any combination of basic characters and combining characters; several combining characters can also follow a basic character. For the display it is therefore not sufficient if the font contains a few additional glyphs, rather information about the dimensions of the individual characters is necessary in order to combine the basic character with the combining character. This is implemented, for example, by the OpenType concept.

In the Unicode standard, combining characters are identified by their character class ( general category ) M. This in turn is divided into three subclasses: Nonspacing Mark ( Mn) for combining characters that usually do not require a separate space ( e.g. diacritical marks), Enclosing Mark ( Me) for combining characters that completely enclose the basic character, and Spacing Combining Mark ( Mc) for combining characters that need their own space (e.g. Indian vowel characters).

In addition, each character is assigned a Combining Class property . This is an integer between 0 and 255 that essentially indicates the position where the combining character is appended to the base character. For example, all combining characters that are placed above the basic character have the value 230, characters below the basic character have the value 220. For normal, non-combining characters, the value is always 0, but there are also some combining characters with this Value.

presentation

The Unicode standard makes only a few binding statements about how programs should display character strings with combining characters. However, the following recommendations are made:

A Greek alpha with spiritus lenis and gravis
  • If several combining characters follow a basic character, these should be added one after the other from the inside out. For example, the sequence <Latin small letter a U + 0061, combining circumflex U + 0302, combining tilde U + 0303> results in an a with a tilde above the circumflex (ẫ), while <Latin small letter a U + 0061 , Combining tilde U + 0303, combining circumflex U + 0302> conversely, the circumflex is above the tilde (ã̂). An important exception to this principle are accents in Greek. In the sequence <Greek lowercase letter alpha U + 03B1, combining comma as an upper sign U + 0313, combining grave accent U + 0300> the grave accent should not be above the comma, but behind it (ἂ). A deviation from the usual stacking can also be enforced with the special combining symbol Combining Grapheme Joiner .
  • If several combining characters follow one another, which are appended to the basic character in different places (e.g. above and below, more precisely this depends on the Combining Classproperty), the order must not matter, the result must look the same in both cases. This results in <Latin small letter a U + 0061, combining point as a major sign U + 0307, ​​combining point as a subsign U + 0323> and <Latin small letter a U + 0061, combining point as a subscript U + 0323, combining point as a major sign U + 0307 > both an a with a point above and a point below (ạ̇).
  • If the typographic tradition puts the diacritical mark in a different position, this is possible. A comma below a g is usually displayed as an inverted comma above the g.
  • The periods of i, j and some other characters with the Soft_Dottedproperty are removed.
  • In the ideal case, a program orients itself in the positioning of combining characters on the exact appearance of the basic letters, so an accent above a capital letter will normally be higher than a lower case letter. However, the standard makes it clear that simple positioning in the same place is always acceptable.

There are special, extensive rules for the representation of combining characters in the Indian scripts in Unicode .

In some cases, you want diacritical marks that span two or more basic characters. There are two techniques for doing this:

On the one hand, there are so-called double combining characters, which not only extend over the preceding basic character like normal combining characters, but also over the character following the double combining character. For example, <Latin small letter n U + 006E, combining tilde twice as wide U + 0360, Latin small letter g U + 0067> gives ng: n͠g spanned by a tilde.

On the other hand, there are special combining half characters. Here the first half follows the first basic character, the second after the second. Thus one can also represent ng with tilde by <Latin small letter n U + 006E, combining double-width tilde (left half) U + FE22, Latin small letter g U + 0067, combining double-width tilde (right half) U + FE23>, this also results n︢g︣.

In order to represent a combining character on its own, it should be preceded by a non- breaking space . The previous recommendation to use a normal space has been discarded because of problems with processing such spaces in XML and in other contexts. For many diacritical marks there are also non-combining variants in the Unicode block Spacing Modifier Letters . In technical documentation, combining characters are often shown with a dotted circle (◌); this indicates the position at which the combining character is added to the basic character.

Ambiguous representations

The concept of combining signs means that there are signs that can be represented by signs in several different ways. There are two reasons for this:

Two different representations for ñ, an n with a tilde

On the one hand, there is a separate symbol for many common combinations of basic characters and diacritical marks. A ñ can be represented as <Latin lowercase letter n U + 006E, combining tilde U + 0303>, but there is also a separate Latin lowercase letter n with a tilde at the code point U + 00F1.

On the other hand, sequences of combining signs that do not interact with each other result in the same sign.

Overall, the number of different representations can be very large as a result, for kleine, the small a with a circumflex and a point at the bottom, there are approximately the following representation options:

  • <Latin small letter a with circumflex and period below U + 1EAD>
  • <Latin small letter a with circumflex U + 00E2, combining period as sub-sign U + 0323>
  • <Latin Small Letter a with Dot Below U + 1EA1, Combining Circumflex U + 0302>
  • <Latin Small Letter a U + 0061, Combining Circumflex U + 0302, Combining Dot Sub-Sign U + 0323>
  • <Latin Small Letter a U + 0061, Combining Dot Sub-Sign U + 0323, Combining Circumflex U + 0302>

In order to get a clear representation (e.g. if you want to know whether two words are the same), there are different normalizations . For this purpose, the standard specifies for each character whether it can be broken down into a basic character and combining characters, and if so, how. First, all characters are broken down in the specified way, then sequences of combining characters that do not interact with each other are Combining_Classsorted according to their property. This provides the canonical decomposition (NFD).

Coded characters in Unicode

Currently (as of: Unicode 7.0, June 2014) the Unicode standard 1830 defines combining characters that are spread over several blocks .

The three blocks Combining Diacritics , Combining Diacritics, Complement, and Combining Diacritics, Expanded contain diacritical marks intended for letters of all alphabets.

The Unicode Block Combining Diacritics for Symbols also contains combining characters, but these are intended for use with symbols. This is how you can put together warning signs : <Dangerous electrical voltage U + 26A1, combining triangle upwards U + 20E4> results in ⚡⃤.

The combining half characters are in the Unicode block Combining half diacritics .

Many other blocks also contain combining characters that are specially designed for use with the other characters in this block. The combining characters for Titlo and other Cyrillic diacritical marks are in the Cyrillic block .

literature

Web links

Individual evidence

  1. Julie D. Allen: The Unicode Standard, version 6.0. 3.6 Combination, p. 83 ff.
  2. PropList.txt
  3. Julie D. Allen: The Unicode Standard, version 6.0. 5.13 Rendering Nonspacing Marks, p. 157.
  4. Julie D. Allen: The Unicode Standard, version 6.0. 2.11 Combining Characters, p. 46.
  5. DerivedGeneralCategory.txt