# Letter frequency

The frequency of letters ( frequency of graphs ) is a statistical quantity that indicates how often a certain letter occurs in a text or a collection of texts (corpus) . It can be specified as an absolute number or in relation to the total number of letters in the text. The frequency distribution of the letters depends on the language in question. While earlier assumptions believed to predict the statistical distribution of the frequency of letters by Zipf's law , quantitative linguistics has shown that a number of other probability distributions must be taken into account. Counts of the frequency of letters or sounds in texts or text corpora have been traceable since the early 19th century at the latest. For some purposes it is also interesting how often a letter occurs at the beginning or at the end of a word.

## application

The letter frequency is used in the decryption of substitution processes in cryptanalysis as well as in data compression and coding . With simple encryption methods such as the Caesar cipher , a ciphertext can be decrypted by frequency analysis alone . The frequencies of the individual characters in the ciphertext are determined and then compared with the frequency of the characters in a plain text of the assumed language. Now the letters of the ciphertext are replaced by the normal letters of the same frequency. The most common letter of the ciphertext then corresponds, for example, to the plaintext letter e. This method is obviously particularly well suited for longer texts to be deciphered, because the statistical deviation between the frequency of letters found and the frequency to be expected is smaller.

For typing lessons, it is important that the teacher is well informed about the frequency of letters in a language and that the lesson content is tailored accordingly. Frequent letters like the E or the I have to be trained sufficiently to achieve the highest possible number of keystrokes and good writing reliability. When creating ergonomic keyboard layouts , the frequency of letters also plays a major role. Manufacturers of letter games such as Boggle or Scrabble also take into account the frequency and, if available, the value of the letters in the national variants.

One of the earliest uses was the Morse alphabet , which uses short codes for common characters (for example, E = · ); for characters that are rarely used, however, longer codes (for example Q = - - · - ).

## Continuation

The continuation of the letter frequency is the frequency of letter pairs and triples and the word frequency as well as of writing units that stand for a systematic sound unit (graphemes for phonemes). If you deal with the spoken language instead of the written language, you can also carry out surveys on the frequency of sounds or phonemes .

## Frequency of letters in German-language texts

From the following table it can be mathematically derived that the five most common letters cover around half and the ten most common letters three quarters of the letter frequency in German-language texts. The umlauts ä, ö and ü were counted like ae, oe and ue , ß as a separate character.

space Letter Relative frequency
1. E. 17.40%
2. N 9.78%
3. I. 7.55%
4th S. 7.27%
5. R. 7.00%
6th A. 6.51%
7th T 6.15%
8th. D. 5.08%
9. H 4.76%
10. U 4.35%
11. L. 3.44%
12. C. 3.06%
13. G 3.01%
14th M. 2.53%
15th O 2.51%
16. B. 1.89%
17th W. 1.89%
18th F. 1.66%
19th K 1.21%
20th Z 1.13%
21st P 0.79%
22nd V 0.67%
23. 0.31%
24. J 0.27%
25th Y 0.04%
26th X 0.03%
27. Q 0.02%

For comparison: If the 27 letters were evenly distributed , the relative frequency would be 3.704%.

For comparison, a file that contains 99,586 letters from a mixed corpus of one person's letters (correspondence with authorities, friends, colleagues, broadcasters, publishers ...; always only the current text, i.e. without letterhead, salutation and greeting; letters from 1996– 2004) is based. In contrast to the previous overview, the umlaut letters <ä>, <ö> and <ü> are each raised separately.

space Letter Absolute frequency Relative frequency
1. E. 16,040 16.11%
2. N 10,288 10.33%
3. I. 9,011 9.05%
4th R. 6,693 6.72%
5. T 6.312 6.34%
6th S. 6,203 6.23%
7th A. 5,577 5.60%
8th. H 5,177 5.20%
9. D. 4.156 4.17%
10. U 3,680 3.70%
11. C. 3,384 3.40%
12. L. 3.226 3.24%
13. G 2,924 2.94%
14th M. 2,784 2.80%
15th O 2,312 2.32%
16. B. 2.176 2.19%
17th F. 1,701 1.71%
18th W. 1,383 1.39%
19th Z 1,351 1.36%
20th K 1,329 1.33%
21st V 912 0.92%
22nd P 841 0.84%
23. Ü 636 0.64%
24. Ä 511 0.51%
25th Ö 363 0.36%
26th 189 0.19%
27. J 186 0.19%
28. X 112 0.11%
29 Q 73 0.07%
30th Y 56 0.06%

The Institute for German Language in Mannheim offers various character and letter frequency lists for download on its website. The statistics are based on a text sample of almost 180 billion characters from the German reference corpus (as of 2018).

Duden offers an overview of the frequency of letters in the form of a bar chart on the basis of the Duden corpus, a full text collection with over 2 billion word forms; The umlaut letters are also listed individually in this overview. The graphic was revised in the 27th edition of the Spelling Duden, now on the basis of the Duden corpus with now 4 billion word forms (as of spring 2017).

### first letters

The frequency of initial letters indicates how often a letter appears as the first letter of a word. It depends to a large extent on the type of text. The five most common initial letters for running text are:

space D. Letter Relative frequency 1. 14.2% 2. 10.8% 3. 7.8% 4th 7.1% 5. 6.8%

There is a different distribution for lexica. The letters D, E, I and W appear much less often at the beginning of a word than in the running text, S is the most common with a clear margin:

space Letter Relative frequency
1. S. 11.8%
2. K 7.3%
3. A. 7.1%
4th P 7.0%
5. B. 5.7%
6th M. 5.7%

### Final letters

The frequency of final letters indicates how often a letter occurs as the last letter of a word. (The novel Effi Briest by Theodor Fontane was evaluated as an example text basis , whereby ß was always counted as ss . The text basis comprises all 36 chapters of this work with a total of 572,849 characters.)

space Letter Relative frequency
1. N 21.0%
2. E. 15.1%
3. R. 13.0%
4th T 10.3%
5. S. 9.6%

## Frequency of letters in selected languages

Letter German English French Spanish Esperanto Italian Swedish Polish
a 06.51% 08.167% 07.636% 12.53% 12.12% 11.74% 9.3% 8.0%
b 01.89% 01.492% 00.901% 01.42% 00.98% 00.92% 1.3% 1.3%
c 03.06% 02,782% 03,260% 04.68% 00.78% 04.5% 1.3% 3.8%
d 05.08% 04.253% 03.669% 05.86% 03.04% 03.73% 4.5% 3.0%
e 17.40% 12.702% 14.715% 13.68% 08.99% 11.79% 9.9% 6.9%
f 01.66% 02.228% 01.066% 00.69% 01.03% 00.95% 2.0% 0.1%
G 03.01% 02.015% 00.866% 01.01% 01.17% 01.64% 3.3% 1.0%
H 04.76% 06.094% 00.737% 00.70% 00.38% 01.54% 2.1% 1.0%
i 07.55% 06.966% 07.529% 06.25% 10.01% 11.28% 5.1% 7.0%
j 00.27% 00.153% 00.545% 00.44% 03.50% 00.00% 0.7% 1.9%
k 01.21% 00.772% 00.049% 00.00% 04.16% 00.00% 3.2% 2.7%
l 03.44% 04.025% 05.456% 04.97% 06.14% 06.51% 5.2% 3.1%
m 02.53% 02.406% 02.968% 03.15% 02.99% 02.51% 3.5% 2.4%
n 09.78% 06.749% 07.095% 06.71% 07.96% 06.88% 8.8% 4.7%
O 02.51% 07.507% 05.378% 08.68% 08.78% 09.83% 4.1% 7.1%
p 00.79% 01.929% 03.021% 02.51% 02.74% 03.05% 1.7% 2.4%
q 00.02% 00.095% 01.362% 00.88% 00.00% 00.51% 0.007% 0.00%
r 07.00% 05.987% 06.553% 06.87% 05.91% 06.37% 8.3% 3.5%
s 07.27% 06.327% 07.948% 07.98% 06.09% 04.98% 6.3% 3.8%
t 06.15% 09,056% 07.244% 04.63% 05.27% 05.62% 8.7% 2.4%
u 04.35% 02,758% 06.311% 03.93% 03.18% 03.01% 1.8% 1.8%
v 00.67% 00.978% 01.628% 00.90% 01.90% 02.10% 2.4% 0.00%
w 01.89% 02,360% 00.114% 00.02% 00.00% 00.00% 0.03% 3.6%
x 00.03% 00.150% 00.387% 00.22% 00.00% 00.00% 0.1% 0.00%
y 00.04% 01,974% 00.308% 00.90% 00.00% 00.00% 0.6% 3.2%
z 01.13% 00.074% 00.136% 00.52% 00.50% 00.49% 0.02% 5.1%
œ 00.00% 00.00% 00.018% 00.00% 00.00% 00.00% 0.00% 0.00%
ß 00.31% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% 0.00%
à 00.00% 00.00% 00.486% 00.00% 00.00% see a 0.00% 0.00%
ą 00.00% 00.00% 00.00% 00.00% 00.00% 00.00% 0.00% see a
ç 00.00% 00.00% 00.085% 00.00% 00.00% 00.00% 0.00% 0.00%
ĉ 00.00% 00.00% 00.00 0 % 00.00% 00.66% 00.00% 0.00% 0.00%
ć 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see c
è 00.00% 00.00% 00.271% 00.00% 00.00% see e 0.00% 0.00%
é 00.01% 00.00% 01.904% 00.00% 00.00% see e 0.00% 0.00%
ê 00.00% 00.00% 00.225% 00.00% 00.00% 00.00% 0.00% 0.00%
ë 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% 0.00%
ę 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see e
G 00.00% 00.00% 00.00 0 % 00.00% 00.69% 00.00% 0.00% 0.00%
H 00.00% 00.00% 00.00 0 % 00.00% 00.02% 00.00% 0.00% 0.00%
î 00.00% 00.00% 00.045% 00.00% 00.00% 00.00% 0.00% 0.00%
ì 00.00% 00.00% 00.00 0 % 00.00% 00.00% see i 0.00% 0.00%
ï 00.00% 00.01% 00.005% 00.00% 00.00% 00.00% 0.00% 0.00%
ĵ 00.00% 00.00% 00.00 0 % 00.00% 00.12% 00.00% 0.00% 0.00%
ł 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see l
ń 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see n
O 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see above
O 00.00% 00.00% 00.00 0 % 00.00% 00.00% see above 0.00% 0.00%
ŝ 00.00% 00.00% 00.00 0 % 00.00% 00.38% 00.00% 0.00% 0.00%
ś 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see p
ù 00.00% 00.00% 00.058% 00.00% 00.00% see u 0.00% 0.00%
ŭ 00.00% 00.00% 00.00 0 % 00.00% 00.52% 00.00% 0.00% 0.00%
ź 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% see e.g.
ż 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 0.00% 0.7%

What is particularly noteworthy in the table is that the letter E is used significantly more frequently in German and the letter O is used significantly less often than in Romance and Slavic languages.

The table only shows the frequencies of letters in texts / corpora of languages ​​for which the Latin script is used. For the frequency of letters in languages ​​with the Cyrillic script, reference can be made to the description by Kempgen (1995) on Russian and the study by Grzybek & Kehlich (2005) on Ukrainian.

