Normalization (text)

Under normalization of text refers to the conversion into another form, only relevant to the desired context information is retained in the. Depending on the application, normalization can take place very differently.

Examples

Some character sets, especially Unicode , allow a character to be represented in different ways. In applications, however, only one of the possible forms is usually desired, so that the normalization has to convert the text into this form. There are four possibilities for this normalization especially for Unicode .

When building a search index, normalization must meet different requirements depending on what the user expects. Some options are:

Punctuation marks can be removed.
Accented characters can be replaced with their basic letter. Likewise, ä can be replaced by ae and ß by ss.
All characters can be converted to capital letters.
Characters from other alphabets can be transliterated .

Some of these requirements can be met with the help of the Unicode Collation Algorithm .

In order to prevent spoofing , for example the possibility that two users can log into an Internet forum whose names look identical, visually similar characters must be replaced with the same character during normalization. Both the number 1 and the lowercase letter I could be replaced by the uppercase letter I.

For speech synthesis , numbers, special characters and abbreviations - partly depending on the context - have to be resolved in order to be read out correctly.

Web links

Demonstration of visually similar characters