Coincidence index

The index of coincidence (ger .: Index of coincidence , abbreviation: IC ) obtained by statistical analysis of the frequency of single characters (ie most of the individual letters) one or two texts. With its help, encrypted or incomprehensible texts can be examined for linguistic properties. It is used specifically for the deciphering of historical written documents and generally in cryptanalysis . The method was developed by the American cryptanalyst William F. Friedman for cryptological purposes and published in 1922 in his groundbreaking work The index of coincidence and its applications in cryptography. (German: "The coincidence index and its applications in cryptography").

{\ displaystyle \ mathbf {IC} = {\ frac {\ sum _ {i = A} ^ {Z} n_ {i} (n_ {i} -1)} {N (N-1)}}}

In its basic form, the coincidence index is determined by counting the individual numbers of the different individual characters of a ciphertext , for example how often the letter A occurs, how often B, and so on. These are multiplied by the individual numbers reduced by 1 using the formula given above and added up for all letters (e.g. from A to Z). The sum is then divided by the total number N of letters in the text (i.e. the text length) and the text length reduced by 1. The result is Friedman's coincidence index IC. ${\ displaystyle n_ {i}}$

Natural languages each have their typical coincidence index.

Definitions

In general terms, four functions are summarized under the term coincidence index , which are usually designated with the Greek letters and (kappa, chi, psi and phi). It is often referred to as the coincidence index, although from a historical point of view it is more likely to be entitled to this name. ${\ displaystyle \ kappa, \ chi, \ psi \}$ ${\ displaystyle \ phi \}$ ${\ displaystyle \ phi \}$ ${\ displaystyle \ kappa \,}$

Given are two texts of equal length over the same alphabet. Then that is the two texts ${\ displaystyle T = x_ {1} x_ {2} \ ldots x_ {k}, T '= x' _ {1} x '_ {2} \ ldots x' _ {k}}$ ${\ displaystyle \ kappa \,}$

{\ displaystyle \ kappa (T, T ') = \ sum _ {i = 1} ^ {k} {\ frac {\ delta (x_ {i}, x' _ {i})} {k}},}

where the Kronecker delta denotes (i.e. , if and otherwise). ${\ displaystyle \ delta \,}$ ${\ displaystyle \ delta (x_ {i}, x '_ {i}) = 1}$ ${\ displaystyle x_ {i} = x '_ {i}}$ ${\ displaystyle 0}$

This is a number between 0 and 1, where 1 occurs exactly when both texts are the same. If the characters are chosen randomly with the same probability from an alphabet with characters, the expected value for is equal , since every summand is equal with probability (and otherwise equal to 0). ${\ displaystyle \ kappa \,}$ ${\ displaystyle n}$ ${\ displaystyle \ kappa \,}$ ${\ displaystyle {\ tfrac {1} {n}}}$ ${\ displaystyle {\ tfrac {1} {n}}}$ ${\ displaystyle {\ tfrac {1} {k}}}$

Since one often wants to squeeze a lot of information out of short texts in cryptanalysis, the function , which, like the following functions, goes back to Friedman's colleague Solomon Kullback (1935), is occasionally more meaningful: ${\ displaystyle \ chi \,}$

{\ displaystyle \ chi (T, T ') = \ sum _ {i = 1} ^ {k} \ sum _ {j = 1} ^ {k} {\ frac {\ delta (x_ {i}, x' _ {j})} {k ^ {2}}} = \ sum _ {\ ell = 1} ^ {n} {\ frac {m _ {\ ell} \ cdot m '_ {\ ell}} {k ^ {2}}},}

where indicates how often the -th character of the alphabet occurs in the text or . The function depends solely on the frequency of letters in the two texts. Now is ${\ displaystyle m _ {\ ell}, m '_ {\ ell}}$ ${\ displaystyle \ ell}$ ${\ displaystyle T}$ ${\ displaystyle T '}$ ${\ displaystyle \ chi \,}$

{\ displaystyle \ psi (T) = \ chi (T, T). \,}

While applied to two texts of random, equally distributed characters like has the expected value , this is no longer the case for, since it always equals 1. It is therefore sensible to exclude and define the characters in the same position during the summation ${\ displaystyle \ chi \,}$ ${\ displaystyle \ kappa \,}$ ${\ displaystyle {\ tfrac {1} {n}}}$ ${\ displaystyle \ psi \,}$ ${\ displaystyle \ delta (x_ {i}, x_ {i})}$

{\ displaystyle \ phi (T) = \ sum _ {1 \ leq i \ neq j \ leq k} {\ frac {\ delta (x_ {i}, x_ {j})} {k (k-1)} } = \ sum _ {\ ell = 1} ^ {n} {\ frac {m _ {\ ell} (m _ {\ ell} -1)} {k (k-1)}}.}

Just as can be calculated from the letter frequencies of the two texts alone, but for a text made of random characters the expected value is equal to , while it is greater (namely ). The difference is particularly striking for short texts. ${\ displaystyle \ psi \,}$ ${\ displaystyle \ phi \}$ ${\ displaystyle \ phi \}$ ${\ displaystyle {\ tfrac {1} {n}}}$ ${\ displaystyle \ psi \,}$ ${\ displaystyle {\ tfrac {n + k-1} {nk}}}$

meaning

If one moves from texts made up of more or less evenly distributed random characters to texts written in a certain natural language, the value of the coincidence index changes considerably. A rule of thumb is that it will be about twice as big. This applies not only to plain texts , but also to mono-alphabetically encrypted ciphertexts , since the frequencies of the individual letters do not change with these methods. That is, the coincidence index is invariant to these types of encryption.

For example, given the 26 letters of our familiar Latin alphabet (umlauts by ae, oe replaced ue, ß by double-s or sz, spaces and punctuation are ignored), so the value is for German language texts and around 0,078 or 7, 8%, while for English-language texts you get around 6.6%. In both cases, and also for all other languages, this is significantly higher than in the case of the uniform distribution of the letters. If all letters of the alphabet occur with the same frequency, as is the case for "randomly" generated letter sequences (" random texts ") or for "strongly encrypted" ciphertexts, the result is a value of around 1/26, i.e. around 3.8% . The higher value of the coincidence index for the German language compared to the English language primarily reflects the greater frequency of the dominant letter E in German (about 18%) compared to English (about 12%). ${\ displaystyle \ kappa, \ chi \,}$ ${\ displaystyle \ phi \,}$

The expected value of the coincidence index for a language can be derived from the letter frequencies according to the formula ${\ displaystyle E_ {S}}$ ${\ displaystyle S}$

{\ displaystyle E_ {S} = \ sum _ {i = 1} ^ {n} p_ {i} ^ {2}}

Calculate, with the probability of the -th character of the alphabet in texts of the corresponding language. ${\ displaystyle p_ {i}}$ ${\ displaystyle i}$

In related languages, the expected values are often similar , so that, in the case of unknown texts, the coincidence index can be used to make assumptions about the associated language area. ${\ displaystyle E_ {S}}$

Application for cryptanalysis

The essential property here is that with a simple monoalphabetic substitution encryption neither change nor change as long as the texts involved are encrypted in the same way. A linguistic assignment of sufficiently long texts is therefore only possible on a statistical basis. ${\ displaystyle \ kappa, \ chi, \ psi \,}$ ${\ displaystyle \ phi \,}$

In the case of periodic polyalphabetic substitution encryption, the coincidence index is even more valuable, because the key length of the encryption can be estimated using the following formula ( Friedman test ). For the language , the formula for the key length is ${\ displaystyle S}$ ${\ displaystyle h}$

{\ displaystyle h \ approx {\ frac {(E_ {S} - {\ frac {1} {n}}) k} {(k-1) \ phi (T) -k {\ frac {1} {n }} + E_ {S}}}.}

This formula can be derived theoretically under the assumption that all key characters are different. The value is particularly informative for short texts encrypted with short keys, especially in combination with the Kasiski test . If you have longer texts encrypted with longer keywords, the following procedure is more precise.

If you remove the first characters and the last characters from the text , you get two texts, which can be determined. If it is a multiple of the key length, it should be that the compared individual characters are encrypted with the same key. If, however, the key length is not a multiple, then a significantly lower value for is to be expected, since the characters compared are only rarely encrypted in the same way. Repeated sequences in the keyword, which can be used to outsmart the Kasiski test and the Friedman test, only have a rudimentary influence on the results and should usually be recognized. ${\ displaystyle T}$ ${\ displaystyle r}$ ${\ displaystyle r}$ ${\ displaystyle \ kappa \,}$ ${\ displaystyle r}$ ${\ displaystyle \ kappa \ approx E_ {S}}$ ${\ displaystyle r}$ ${\ displaystyle \ kappa \,}$

These methods can also be used profitably with non-periodic polyalphabetic encryption. In particular, in the case of two texts encrypted with the same one-time pad, this cryptographic sin can be recognized immediately by calculating and, for example, using the method of the probable word applied to one of the texts, one of the texts can try to generate plaintext sequences in the other text. ${\ displaystyle T, T '\,}$ ${\ displaystyle \ kappa (T, T ') \,}$

The coincidence index is therefore suitable for so-called ciphertext-only attacks (where nothing needs to be known about the content of the encrypted text) on substitution encryption, which means that these methods (except for a correctly used one-time pad, of course) must be viewed as extremely insecure.

literature

Friedrich L. Bauer : Deciphered Secrets. Methods and maxims of cryptology. 3rd, revised and expanded edition. Springer, Berlin a. a. 2000, ISBN 3-540-67931-6 .
William F. Friedman : The index of coincidence and its applications in cryptology. Riverbank Laboratories - Department of Ciphers, Geneva IL 1922 (Reprinted by Aegean Park Press, Laguna Hills CA 1987, ISBN 0-89412-137-5 [ A Cryptographic Series , 49]).

Web links

William Frederick Friedman: The index of coincidence and its applications in cryptology . WorldCat by