Combining grapheme joiner

from Wikipedia, the free encyclopedia

The Combining Grapheme Joiner ( CGJ for short , in German about Combining Grapheme Connector ) is the Unicode character at the code point U + 034F. The name has been chosen incorrectly because the character does not connect any graphemes and does not play a role in the Unicode line break algorithm . For reasons of stability, however, the name can no longer be changed. Formally, the character is defined as a combining character , but the function is a control character in Unicode that is used for different purposes.

Arrangement of complex diacritics

Presentation without CGJ

The symbol has a visible effect when it is represented with diacritical marks that extend over several letters. The Unicode standard provides that such double combining characters are displayed above all other diacritics. For example, in the sequence t - Combining the inverted breve (U + 0361) - combining point as an exaggeration (U + 0307) - s the point above the inverted breve : ṫ͡s

Representation with CGJ

If, on the other hand, you want the point under the breve, you have to put a CGJ in front of it: t͡͏̇s

The CGJ is also used in Hebrew to position certain diacritical marks.

Semantic differentiation of diacritical marks

In Unicode, diacritical marks are not coded according to their function, but only according to their appearance. Therefore, two diacritical marks that have a different semantic meaning but look the same cannot be directly distinguished. For example, there is only one trema that marks both umlauts and a diaries . Therefore it is not possible to tell from an ä whether it is a German umlaut or an ordinary a with a trema. You could code the umlaut directly as ä (U + 00E4) and the a with trema as the sequence <U + 0061, U + 0308>, but this distinction would be lost when normalizing . In order to preserve this distinction, the combining character must be preceded by a CGJ, i.e. the sequence <U + 0061, U + 034F, U + 0308>.

Sorting

In some languages, some are digraphs in alphabetical order as their own letters treated as the ch in Slovak . The Unicode Collation Algorithm can take this into account when configured accordingly. If, as an exception, such a combination should not be sorted as a digraph but as a simple sequence of the two letters, the two characters can be separated with a CGJ.

swell

  • Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 16.2: Layout Controls. ( online , PDF)

Individual evidence

  1. Asmus Freytag, Rick McGowan and Ken Whistler: UTN # 27: Known Anomalies in Unicode Character Names . Status: May 8, 2006