Letter frequency

from Wikipedia, the free encyclopedia

The frequency of letters ( frequency of graphs ) is a statistical quantity that indicates how often a certain letter occurs in a text or a collection of texts (corpus) . It can be specified as an absolute number or in relation to the total number of letters in the text. The frequency distribution of the letters depends on the language in question. While earlier assumptions believed to predict the statistical distribution of the frequency of letters by Zipf's law , quantitative linguistics has shown that a number of other probability distributions must be taken into account. Counts of the frequency of letters or sounds in texts or text corpora have been traceable since the early 19th century at the latest. For some purposes it is also interesting how often a letter occurs at the beginning or at the end of a word.

application

The letter frequency is used in the decryption of substitution processes in cryptanalysis as well as in data compression and coding . With simple encryption methods such as the Caesar cipher , a ciphertext can be decrypted by frequency analysis alone . The frequencies of the individual characters in the ciphertext are determined and then compared with the frequency of the characters in a plain text of the assumed language. Now the letters of the ciphertext are replaced by the normal letters of the same frequency. The most common letter of the ciphertext then corresponds, for example, to the plaintext letter e. This method is obviously particularly well suited for longer texts to be deciphered, because the statistical deviation between the frequency of letters found and the frequency to be expected is smaller.

For typing lessons, it is important that the teacher is well informed about the frequency of letters in a language and that the lesson content is tailored accordingly. Frequent letters like the E or the I have to be trained sufficiently to achieve the highest possible number of keystrokes and good writing reliability. When creating ergonomic keyboard layouts , the frequency of letters also plays a major role. Manufacturers of letter games such as Boggle or Scrabble also take into account the frequency and, if available, the value of the letters in the national variants.

One of the earliest uses was the Morse alphabet , which uses short codes for common characters (for example, E = · ); for characters that are rarely used, however, longer codes (for example Q = - - · - ).

Continuation

The continuation of the letter frequency is the frequency of letter pairs and triples and the word frequency as well as of writing units that stand for a systematic sound unit (graphemes for phonemes). If you deal with the spoken language instead of the written language, you can also carry out surveys on the frequency of sounds or phonemes .

Frequency of letters in German-language texts

From the following table it can be mathematically derived that the five most common letters cover around half and the ten most common letters three quarters of the letter frequency in German-language texts. The umlauts ä, ö and ü were counted like ae, oe and ue , ß as a separate character.

space Letter Relative frequency
1. E. 17.40%
2. N 09.78%
3. I. 07.55%
4th S. 07.27%
5. R. 07.00%
6th A. 06.51%
7th T 06.15%
8th. D. 05.08%
9. H 04.76%
10. U 04.35%
11. L. 03.44%
12. C. 03.06%
13. G 03.01%
14th M. 02.53%
15th O 02.51%
16. B. 01.89%
17th W. 01.89%
18th F. 01.66%
19th K 01.21%
20th Z 01.13%
21st P 00.79%
22nd V 00.67%
23. 00.31%
24. J 00.27%
25th Y 00.04%
26th X 00.03%
27. Q 00.02%

For comparison: If the 27 letters were evenly distributed , the relative frequency would be 3.704%.

For comparison, a file that contains 99,586 letters from a mixed corpus of one person's letters (correspondence with authorities, friends, colleagues, broadcasters, publishers ...; always only the current text, i.e. without letterhead, salutation and greeting; letters from 1996– 2004) is based. In contrast to the previous overview, the umlaut letters <ä>, <ö> and <ü> are each raised separately.

space Letter Absolute frequency Relative frequency
1. E. 16,040 16.11%
2. N 10,288 010.33%
3. I. 9,011 09.05%
4th R. 6,693 06.72%
5. T 6.312 06.34%
6th S. 6,203 06.23%
7th A. 5,577 05.60%
8th. H 5,177 05.20%
9. D. 4.156 04.17%
10. U 3,680 03.70%
11. C. 3,384 03.40%
12. L. 3.226 03.24%
13. G 2,924 02.94%
14th M. 2,784 02.80%
15th O 2,312 02.32%
16. B. 2.176 02.19%
17th F. 1,701 01.71%
18th W. 1,383 01.39%
19th Z 1,351 01.36%
20th K 1,329 01.33%
21st V 912 00.92%
22nd P 841 00.84%
23. Ü 636 00.64%
24. Ä 511 00.51%
25th Ö 363 00.36%
26th 189 00.19%
27. J 186 00.19%
28. X 112 00.11%
29 Q 73 00.07%
30th Y 56 00.06%

The Institute for German Language in Mannheim offers various character and letter frequency lists for download on its website. The statistics are based on a text sample of almost 180 billion characters from the German reference corpus (as of 2018).

Duden offers an overview of the frequency of letters in the form of a bar chart on the basis of the Duden corpus, a full text collection with over 2 billion word forms; The umlaut letters are also listed individually in this overview. The graphic was revised in the 27th edition of the Spelling Duden, now on the basis of the Duden corpus with now 4 billion word forms (as of spring 2017).

first letters

The frequency of initial letters indicates how often a letter appears as the first letter of a word. It depends to a large extent on the type of text. The five most common initial letters for running text are:

space Letter Relative frequency
1. D. 14.2%
2. S. 10.8%
3. E. 07.8%
4th I. 07.1%
5. W. 06.8%

There is a different distribution for lexica. The letters D, E, I and W appear much less often at the beginning of a word than in the running text, S is the most common with a clear margin:

space Letter Relative frequency
1. S. 11.8%
2. K 07.3%
3. A. 07.1%
4th P 07.0%
5. B. 05.7%
6th M. 05.7%

Final letters

The frequency of final letters indicates how often a letter occurs as the last letter of a word. (The novel Effi Briest by Theodor Fontane was evaluated as an example text basis , whereby ß was always counted as ss . The text basis comprises all 36 chapters of this work with a total of 572,849 characters.)

space Letter Relative frequency
1. N 21.0%
2. E. 15.1%
3. R. 13.0%
4th T 10.3%
5. S. 09.6%

Frequency charts

Frequency of letters in selected languages

Letter German English French Spanish Esperanto Italian Swedish Polish
a 06.51% 08.167% 07.636% 12.53% 12.12% 11.74% 000000000000009.30000000009.3% 000000000000008.00000000008.0%
b 01.89% 01.492% 00.901% 01.42% 00.98% 00.92% 000000000000001.30000000001.3% 000000000000001.30000000001.3%
c 03.06% 02,782% 03,260% 04.68% 00.78% 04.5% 000000000000001.30000000001.3% 000000000000003.80000000003.8%
d 05.08% 04.253% 03.669% 05.86% 03.04% 03.73% 000000000000004.50000000004.5% 000000000000003.00000000003.0%
e 17.40% 12.702% 14.715% 13.68% 08.99% 11.79% 000000000000009.90000000009.9% 000000000000006.90000000006.9%
f 01.66% 02.228% 01.066% 00.69% 01.03% 00.95% 000000000000002.00000000002.0% 000000000000000.10000000000.1%
G 03.01% 02.015% 00.866% 01.01% 01.17% 01.64% 000000000000003.30000000003.3% 000000000000001.00000000001.0%
H 04.76% 06.094% 00.737% 00.70% 00.38% 01.54% 000000000000002.10000000002.1% 000000000000001.00000000001.0%
i 07.55% 06.966% 07.529% 06.25% 10.01% 11.28% 000000000000005.10000000005.1% 000000000000007.00000000007.0%
j 00.27% 00.153% 00.545% 00.44% 03.50% 00.00% 000000000000000.70000000000.7% 000000000000001.90000000001.9%
k 01.21% 00.772% 00.049% 00.00% 04.16% 00.00% 000000000000003.20000000003.2% 000000000000002.70000000002.7%
l 03.44% 04.025% 05.456% 04.97% 06.14% 06.51% 000000000000005.20000000005.2% 000000000000003.10000000003.1%
m 02.53% 02.406% 02.968% 03.15% 02.99% 02.51% 000000000000003.50000000003.5% 000000000000002.40000000002.4%
n 09.78% 06.749% 07.095% 06.71% 07.96% 06.88% 000000000000008.80000000008.8% 000000000000004.70000000004.7%
O 02.51% 07.507% 05.378% 08.68% 08.78% 09.83% 000000000000004.10000000004.1% 000000000000007.10000000007.1%
p 00.79% 01.929% 03.021% 02.51% 02.74% 03.05% 000000000000001.70000000001.7% 000000000000002.40000000002.4%
q 00.02% 00.095% 01.362% 00.88% 00.00% 00.51% 000000000000000.00700000000.007% 000000000000000.00000000000.00%
r 07.00% 05.987% 06.553% 06.87% 05.91% 06.37% 000000000000008.30000000008.3% 000000000000003.50000000003.5%
s 07.27% 06.327% 07.948% 07.98% 06.09% 04.98% 000000000000006.30000000006.3% 000000000000003.80000000003.8%
t 06.15% 09,056% 07.244% 04.63% 05.27% 05.62% 000000000000008.70000000008.7% 000000000000002.40000000002.4%
u 04.35% 02,758% 06.311% 03.93% 03.18% 03.01% 000000000000001.80000000001.8% 000000000000001.80000000001.8%
v 00.67% 00.978% 01.628% 00.90% 01.90% 02.10% 000000000000002.40000000002.4% 000000000000000.00000000000.00%
w 01.89% 02,360% 00.114% 00.02% 00.00% 00.00% 000000000000000.03000000000.03% 000000000000003.60000000003.6%
x 00.03% 00.150% 00.387% 00.22% 00.00% 00.00% 000000000000000.10000000000.1% 000000000000000.00000000000.00%
y 00.04% 01,974% 00.308% 00.90% 00.00% 00.00% 000000000000000.60000000000.6% 000000000000003.20000000003.2%
z 01.13% 00.074% 00.136% 00.52% 00.50% 00.49% 000000000000000.02000000000.02% 000000000000005.10000000005.1%
œ 00.00% 00.00% 00.018% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ß 00.31% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
à 00.00% 00.00% 00.486% 00.00% 00.00% see a 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ą 00.00% 00.00% 00.00% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see a
ç 00.00% 00.00% 00.085% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ĉ 00.00% 00.00% 00.00 0 % 00.00% 00.66% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ć 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see c
è 00.00% 00.00% 00.271% 00.00% 00.00% see e 000000000000000.00000000000.00% 000000000000000.00000000000.00%
é 00.01% 00.00% 01.904% 00.00% 00.00% see e 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ê 00.00% 00.00% 00.225% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ë 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ę 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see e
G 00.00% 00.00% 00.00 0 % 00.00% 00.69% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
H 00.00% 00.00% 00.00 0 % 00.00% 00.02% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
î 00.00% 00.00% 00.045% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ì 00.00% 00.00% 00.00 0 % 00.00% 00.00% see i 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ï 00.00% 00.01% 00.005% 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ĵ 00.00% 00.00% 00.00 0 % 00.00% 00.12% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ł 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see l
ń 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see n
O 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see above
O 00.00% 00.00% 00.00 0 % 00.00% 00.00% see above 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ŝ 00.00% 00.00% 00.00 0 % 00.00% 00.38% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ś 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see p
ù 00.00% 00.00% 00.058% 00.00% 00.00% see u 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ŭ 00.00% 00.00% 00.00 0 % 00.00% 00.52% 00.00% 000000000000000.00000000000.00% 000000000000000.00000000000.00%
ź 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% see e.g.
ż 00.00% 00.00% 00.00 0 % 00.00% 00.00% 00.00% 000000000000000.00000000000.00% 000000000000000.70000000000.7%

What is particularly noteworthy in the table is that the letter E is used significantly more frequently in German and the letter O is used significantly less often than in Romance and Slavic languages.

The table only shows the frequencies of letters in texts / corpora of languages ​​for which the Latin script is used. For the frequency of letters in languages ​​with the Cyrillic script, reference can be made to the description by Kempgen (1995) on Russian and the study by Grzybek & Kehlich (2005) on Ukrainian.

See also

literature

  • Friedrich L. Bauer : Deciphered Secrets. Methods and maxims of cryptology. Springer, Berlin et al. 1995, ISBN 3-540-58118-9 . (Contains letter frequencies in German and English with percentages on page 223.)
  • Karl-Heinz Best : On the frequency of letters, spaces and other characters in German texts. In: Glottometrics 11, 2005, ISSN  1617-8351 , pages 9-31. (PDF full text ). (In addition to the frequency of letters, it also indicates the proportions of other characters in German texts.)
  • Erich Mater: German verbs. 1. Alphabetical index. Bibliographisches Institut, Leipzig 1966. (Contains in the opening chapter an overview of the frequency of first letters in 6 different dictionaries as well as an overview. Unfortunately no page count.)
  • Helmut Meier : German language statistics. (= Olms Paperbacks. 31). 2nd, enlarged and improved edition. Olms, Hildesheim 1967. (Letter statistics for German, English and Spanish on page 334.)
  • Gustav Muthmann: Declining German dictionary. Handbook of word exits in German, taking into account the word and sound structure. (= German Linguistics series. 78). Niemeyer, Tübingen 1988, ISBN 3-484-31078-2 . (On page 36 there is a compilation of the frequencies of initial letters and page 65 of the final letters.)
  • Gustav Muthmann: Phonological dictionary of the German language. (= German Linguistics series. 163). Niemeyer, Tübingen 1996, ISBN 3-484-31163-0 , pages 35-37. (Frequency of graphemes and phonemes .)
  • Wolfgang Schönpflug : n-gram frequency in the German language. I. Monograms and digrams. In: Journal for Experimental and Applied Psychology. 16, 1969, ISSN  0044-2712 , pages 157-183. (On page 162f. Contains an overview of the frequency of letters in a text corpus of over 100,000 words, separated according to their position in the word.)
  • Katja Siekmann, Günther Thomé: The spelling mistake. 2nd, updated edition Oldenburg 2018, ISBN 978-3-942122-07-8 (contains detailed overviews on pages 239 to 247 of the frequency of letters and letter combinations from a more recent 100,000 count of phoneme-grapheme correspondences in German. Isb-Verlag Oldenburg ).
  • Dorothea Thomé, Günther Thomé: Phonemes and graphemes in German: three diagrams. 1. The sounds of German (according to the standard wording), 2. Basic graphemes (basic characters for phonemes), 3. All basic and orthographemes (what is how often?). isb-Fachverlag, Oldenburg 2014, ISBN 978-3-942122-15-3 .
  • Günther Thomé, Dorothea Thomé: German words structured according to phonetic and written units. isb-Fachverlag, Oldenburg 2016, ISBN 978-3-942122-21-4 (reading samples from isb-Verlag Oldenburg ; With numerous tables on the frequency of phonetic and written units in German.)

Web links

Individual evidence

  1. See also: Archived copy ( Memento of the original from April 7, 2015 in the Internet Archive ) Info: The archive link was automatically inserted and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. Letters, sounds and phonemes basically follow the same distributions. @1@ 2Template: Webachiv / IABot / lql.uni-trier.de
  2. ^ Karl-Heinz Best: phonetic and letter counting in the early 19th century. In: Glottometrics 20, 2010, pages 110-114. (PDF full text ).
  3. ^ Albrecht Beutelspacher: Kryptologie. 7th edition. Vieweg Verlagsgesellschaft, Wiesbaden 2005, ISBN 3-8348-0014-7 , page 10.
  4. ^ Karl-Heinz Best: Letter frequencies in German and English. In: Naukovyj Visnyk Černivec'koho Universitetu. Vypusk 231, 2005, ZDB -ID 2390772-1 , pages 119-127.
  5. Institute for German Language: Corpus Linguistics: Corpus-based character and letter frequency lists. Retrieved on March 20, 2018 (German).
  6. Duden - German Universal Dictionary. 7th, revised and expanded edition. Dudenverlag, Mannheim / Zurich 2011, ISBN 978-3-411-05507-4 , page 2110.
  7. Duden. The German spelling. 27th, completely revised and expanded edition. Dudenverlag, Berlin 2017, ISBN 978-3-411-04017-9 , pages 148, 158.
  8. a b Peter Vogelgesang: Frequency of letters. ( Memento of February 9, 2006 in the Internet Archive ) 2003.
  9. Robert Edward Lewand: Relative Frequencies of Letters text in General English Plain.
  10. CorpusDeThomasTempé. ( Memento from February 13, 2008 in the Internet Archive )
  11. ^ Fletcher Pratt: Secret and Urgent: the Story of Codes and Ciphers Blue Ribbon Books , 1939, pp. 254-255.
  12. La Oftecoj de la Esperantaj Literoj. Retrieved September 14, 2007 .
  13. Simon Singh: Codici e Segreti. RCS, 1999, ISBN 88-17-12539-3 .
  14. Simon Singh: Brogren Margareta: Kodboken: konsten att skapa sekretess - från det gamla Egypt till kvantkryptering . Norstedt, Stockholm 1999, ISBN 91-1-300708-4 .
  15. Wstęp do kryptologii. ( MS Word ; 300 kB) Retrieved April 30, 2012 .
  16. ^ Sebastian Kempgen: Russian language statistics. Systematic overview and bibliography . Verlag Otto Sagner, Munich 1995, pages 19-22. ISBN 3-87690-617-2 .
  17. Peter Grzybek, Emmerich Kelih: Grapheme frequencies in Ukrainian. Part I: Without an apostrophe (') . In: Gabriel Altmann, Viktor Levickij, & Valentina Perebyinis (eds.): Problemy kvantytatyvnoi linhvistyky / Problems of Quantitative Linguistics: zbirnyk naukovych prac (pp. 159-179). Ruta, Cernivci 2005. ISBN 966-568-783-2 .