Cologne Phonetics
The Cologne Phonetics (also Cologne method ) is a phonetic algorithm that assigns a sequence of digits to words according to their speech sound, the phonetic code. The aim of this procedure is to assign the same code to words that sound the same in order to implement a similarity search in search functions. This makes it possible, for example, to find entries such as “Meier” under other spellings such as “Maier”, “Mayer” or “Mayr” in a name list. Compared to the better-known Russell Soundex method, the Cologne phonetics are better adapted to the German language . It was published in 1969 by Hans Joachim Postel .
Basic rules
The Cologne phonetics map each letter of a word to a digit between “0” and “8”, whereby a maximum of one neighboring letter is used as context for the selection of the respective digit. Some rules apply specifically to the beginning of the word ( initial sound ). In this way, similar sounds are assigned the same code. For example, the two letters “W” and “V” are encoded with the number “3”. The phonetic code for "Wikipedia" is 3412
. In contrast to the Soundex code, the length of the phonetic code is not limited according to the Cologne Phonetics.
Letter codes
Letter | context | code |
---|---|---|
A, E, I, J, O, U, Y | 0 | |
H | - | |
B. | 1 | |
P | not before H | |
D, T | not before C, S, Z | 2 |
F, V, W | 3 | |
P | before H | |
G, K, Q | 4th | |
C. | Initially before A, H, K, L, O, Q, R, U, X | |
before A, H, K, O, Q, U, X except after S, Z | ||
X | not after C, K, Q | 48 |
L. | 5 | |
M, N | 6th | |
R. | 7th | |
S, Z | 8th | |
C. | to S, Z | |
Initially except in front of A, H, K, L, O, Q, R, U, X | ||
not in front of A, H, K, O, Q, U, X | ||
D, T | before C, S, Z | |
X | to C, K, Q |
The fact that for the letter “C” the rule “ S C” has priority over the rule “C H ” was taken into account by adding “except after S, Z” in line 10 of the table. Although this is not explicitly mentioned in the original publication, it can be deduced from the examples given there (e.g. for “Brezhnev” the code “17863” is given).
Lower case letters are coded in the same way, all other characters (e.g. hyphens ) are ignored. For the umlauts Ä, Ö, Ü and ß that are not taken into account in the conversion table , it is advisable to classify them with the vowels (code "0") or the group S, Z (code "8").
A word is converted in three steps:
- Coding from left to right in letters according to the conversion table.
- Remove all digits that appear next to each other.
- Remove all codes "0" except at the beginning.
example
The name Müller-Lüdenscheidt is coded as follows:
- Letter-wise coding: 60550750206880022
- Remove all digits that appear next to each other: 6050750206802
- Remove all codes "0": 65752682
It should be noted that the name Müller-Lüdenscheidt is treated as a single word through the hyphen. If "Heinz Classen" is coded with the usual implementation and the fact that it is 2 words is ignored, then 068586 results, where Z becomes 8 and C also becomes 8 and the second 8 is omitted. If it is treated as two words, then C becomes 4 and remains, so you get the correct coding "068 4586".
See also
literature
- Hans Joachim Postel: The Cologne Phonetics. A method of identifying personal names based on gestalt analysis. In: IBM-Nachrichten , Volume 19, 1969, pp. 925–931.
Web links
- Martin Wilz: Aspects of the coding of phonetic similarities in German proper names ( Memento from July 1, 2007 in the Internet Archive ) (PDF; 502 kB). Master's thesis at the Philosophical Faculty of the University of Cologne, 2005; contains an implementation in the Perl programming language .
- Maroš Kollár: Perl implementation of Cologne Phonetics and similar procedures as free software in the CPAN (Comprehensive Perl Archive Network)
- Andy Theiler: PHP and Oracle PL / SQL implementation of the Cologne Phonetics
- Nicolas Zimmer: PHP implementation of the Kölner Phonetik in a comment on the entry soundex in the PHP manual, 2008.
- Falk Meyer: Java implementation of Kölner Phonetik for Apache Commons Codec