Soundex

from Wikipedia, the free encyclopedia
Soundex algorithm

Soundex is a phonetic algorithm for indexing words and phrases according to their sound in the English language . Words that sound the same should be coded to form an identical character string .

However, the Soundex algorithm often also produces good results for the German language.

Soundex was developed by Robert Russell and Margaret Odell for the indexing of family names in the United States Census and patented in 1918 ( US Patent 1,261,167). The Soundex code for a word consists of its first letter, followed by three digits that represent the consonants of the word following the first letter . Similar sounds have the same code (e.g. B, F, P and V are all coded with the number “1”).

Basic rules

Each Soundex code consists of a letter followed by three digits, e.g. B. W-213 for Wikipedia. If the word to be coded has so many letters that you could generate more digits, you break off after the third digit. If the word has too few letters, the last digits are filled with zeros. So the Asian name Lee is encoded as L-000.

Letter codes

Digit Represented letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4th L.
5 M, N
6th R.

The vowels A, E, I, O and U and the consonants H, W and Y are to be ignored except for the first character. The following can also be defined for the German language: The umlauts Ä, Ö and Ü are to be ignored, the "sharp S" ß is coded like the simple S.

If several consecutive letters in the original string have the same Soundex code, this only appears once in the result, so abfx becomes about A120 (a remains because first letter, b and f both result in the same code 1, x results in 2, at the end a zero is appended to get four characters).

When it comes to the practical application of the Soundex process, two main points are criticized: on the one hand, it is very much oriented towards the English language , on the other hand, it only offers a very rough analysis.

Nevertheless, it should be noted that the algorithm shown is probably the most frequently used one for phonetic searches . The fact that a corresponding PL / SQL standard command was implemented for the Oracle database at a very early stage certainly contributed to this .

Different variants were later developed specifically for other languages. For example, in addition to the standard Soundex process, the so-called " Cologne process " (or "Cologne phonetics") is also implemented under SAP for German purposes.

Lately the following example has established itself as a demonstration of the very rough analysis: According to the Soundex method, the terms Britney Spears and proven Super Bitch are phonetically identical:

Britney → BRTN → B635 proven → BRTN → B635
Spears → SPRS → S162 Super Zicke → SPRZCK → S16222 → S162

Soundex for the German language

The Soundex algorithm can also be used for the German language by adapting the letter codes.

Letter codes for the German language

Digit Represented letters
0 a, e, i, o, u, ä, ö, ü, y, j, H
1 b, p, f, v, w
2 c, g, k, q, x, s, z, ß
3 d, t
4th l
5 m, n
6th r
7th ch

Most obvious is that the German alphabet with the umlauts “ä”, “ö” and “ü” as well as the “ß” has letters that do not exist in English. Since the German umlauts are nothing more than vowels from a phonetic point of view, “ä”, “ö” and “ü” are treated in exactly the same way as the other vowels and are initially replaced by zeros during coding and later eliminated. The “sharp ß” is represented by the number 2 like the simple “s”.

Since the letter “ß” only exists as a lower case letter, the letters should be converted into lower case rather than upper case in the first step of the algorithm.

Another difference between German and English is the function of the letter “j”. While the “j” in English (as in “just” or “join”) is pronounced as a sibilant, which is represented in German by the letter sequence “tsch”, the “j” in German has the same function as the “ y ”(see“ Yes ”and“ Ja ”) and must therefore fall into the group of vowels and semi-vowels.

The letter “w” behaves similarly, which occurs in English as a half-vowel (as in “what”) or as a silent letter (as in “awesome”), but in German as a voiced counterpart to the letters “f” or “v” is needed. Therefore the “w” in German has to be coded with 1.

Another specialty of the German language is the sequence of letters “ch”, which represents sounds like “ich” or “ach”. Both sounds do not exist in the English language. Since “ch” does not fit into any of the existing categories, a seventh category is created.

As a further adaptation to the German language, it would have to be investigated whether the length of the code words, which is limited to three digits, is appropriate for the German language, since the word length in German tends to be longer than in English.

See also

Web links

Individual evidence

  1. Adaptation of the Soundex algorithm for the German language | Web, app and software development Bayreuth - groupXS. Retrieved on July 9, 2018 (German).