Unicode casing algorithm

from Wikipedia, the free encyclopedia

The Unicode standard includes several algorithms , the uppercase and lowercase letters ( English case relate), the Unicode Casing algorithms . These algorithms make it possible to convert texts into a different spelling (for example, converting all letters to lower case), to determine whether a text is in a certain spelling (e.g. completely in upper case), and to check two texts for equality regardless of their spelling. The algorithms partially allow adaptation to the language used. In most languages, the icapital letter belonging to the small is the usual capital letter I, whereas in Turkish it is the capital letter İwith a point above it.

Basics

There are three different spellings that have a special name: the lower case, in which all letters are in lower case , the upper case in which all letters are in upper case , and the title case, in which the first letter of a word is an upper case letter, while the following letters are lowercase. There is also the so-called case - fold normal form.

For each Unicode character, the Unicode standard defines a number of properties that are used in the algorithms. These properties indicate whether a character is a lowercase or an uppercase letter and, if so, which is the associated uppercase or lowercase letter.

The assignments of lower and upper case letters can be divided into three groups:

  • In the case of the simple illustrations, only a single character is assigned to each letter, for example aas a lowercase letter toA
  • In the case of complex images, a character string of several characters is assigned to a letter. So ßwhen converting to uppercase it becomes two characters SS. This also happens when converting letters with diacritical marks that have only been coded in one spelling. The lowercase W with a ring above (ẘ, U + 1E98) must be represented by two characters when converting to capital letters: a capital W and a combining ring above .
  • Finally, there are conversions that depend on the language or on a particular context. For example, the Greek capital sigma (Σ, U + 03A3) becomes the ordinary lower case sigma (σ, U + 03C3) when converted to lower case, unless it is at the end of a word. In this case, the lowercase letter takes the final form (ς, U + 03C2). Comprehensive rules for different languages ​​are available in the Common Locale Data Repository .

Change of spelling

In order to convert a text to lower case, each character is replaced by the corresponding lower case letter. Both the simple and the complex conversions are to be used and the context in which the character is used must be observed. The conversion to capitalization is carried out in the same way.

For the conversion to title writing, the word boundaries are first determined according to the corresponding Unicode segmentation algorithm . For each word, the first character is determined, which can be in different spellings and this is replaced by the corresponding title. The remaining characters up to the next word boundary are converted to lower case letters.

Possible adaptations of these algorithms are to use different mappings for the individual characters, for example only the simple conversions or language-specific variants. You could also use the capital ß to capitalize the “ ß ”.

For example, if you want to convert the word “Wikipedia” to capitalization, you simply replace each letter with the corresponding capital letter and you get “WIKIPEDIA”. If, on the other hand, the Turkish word for Wikipedia “Vikipedi” is capitalized, one should use the corresponding images for Turkish, which indicate the “İ” (U + 0130) as a capital letter for the “i”, resulting in “VİKİPEDİ”.

If you want to convert “ΚΌΣΜΟΣ” into lower case letters, the first sigma is in the middle of the word, thus becoming “σ”, while the second sigma is at the end of the word and is converted into a “ς”. So it results in “κόσμος”.

Spelling-independent comparison

In order to check two texts for equality regardless of their spelling, both are converted into a special normal form . This normal form, known as casefold , is essentially lower case. Here, too, all characters are individually replaced by their Case_Foldingequivalent.

Then both texts should be converted into the same Unicode normal form before they are compared. In fact, in some rare cases it is necessary to alternate the various normalizations several times.

A special variant is intended for spelling-independent identifiers in programming languages: Here, all characters that are identified as Default_Ignorable( e.g. control characters for formatting ) are removed and the character string is then converted to the normal form NFKC. To simplify matters, there is the property NFKC_Casefoldthat allows a simple conversion of the individual characters, so that at the end only the rearrangement of the combining characters and the conversion into the combined normal form have to be done separately.

For example, if you want to check whether the two words "MASSE" and "Maß" match regardless of the spelling, you can convert both to the normal case-fold form by Case_Foldinglooking up the property for each letter and replacing the letters accordingly. The first word is “mass”, since all capital letters are replaced by lower case letters. The second word is also normalized to “masse”, since the normalization “ss” is specified for the ß. The two words are the same except for the spelling.

Determine the spelling

In order to determine whether a text has a certain notation, it is converted into this. If it does not change, it was already in this spelling. For the sake of simplicity, a property is available for each notation including the casefold , which indicates whether the character changes or not, so that this property only needs to be tested for each character. A text is in a certain spelling exactly when none of its characters would change during the conversion.

According to this definition, texts that consist only of characters that, like digits, do not have different upper or lower case, are available in any spelling. So in order to check whether a text is completely in lower case, it makes sense not only to check whether it is in lower case according to this definition. In addition, it should be tested whether there is a notation that does not apply.

Both “UNICODE” and “123” are in capital letters, but not “Unicode”. This can be checked either by doing the conversions or by looking at the Changes_When_Uppercasedproperty of all characters. While this is wrong for all characters in the first two character strings, the third character string has letters with this property, namely the "n" and all the following. "UNICODE" is a real capitalization because the word is not also in lower case, this would be "unicode". “123”, on the other hand, is available in all spellings, as can be checked through the individual conversions, or on the basis of the Changes_When_Casemappedproperty. This indicates whether a character changes during any conversion. So it is always wrong for the digits, while it applies to Latin letters.

swell

  • Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . ( online , PDF)

Web links