Law of distribution of word lengths

from Wikipedia, the free encyclopedia

The law of the distribution of word lengths means that words of different lengths in texts and / or in dictionaries are not chaotic, but are distributed according to law.

The word length can be defined in different ways; most commonly it is indicated by the number of letters , sounds , morphs or syllables per word. Regardless of which choice one makes, it is to be expected that the frequencies with which the words sorted according to length are represented in a text or in the lexicon are regularly distributed . The law of the distribution of word lengths is one of many legislative proposals in quantitative linguistics . The law recently have Altmann , u Wimmer. a. derived; the proposals for this law, first made since the 1940s by Sergei Tschebanow (1947), William Palin Elderton (1949) and Wilhelm Fucks (1955), are contained in this new theory as special cases. An abundance of tests in German and over 50 other languages ​​(over 4000 texts and some dictionaries) confirm the theory (Best 1997, 2001, 2003; Schmidt 1997). Word lengths are by far the most researched size of language. For the history of the law from the 1940s and its criticism, see Grzybek (2006). It has been shown that the Hyperpoisson distribution is a particularly frequently applicable model. Depending on the language, author, time and type of text, other models often have to be used.

The law applies analogously to other speech units such as morphology, rhythmic units, phrases and syllables (see law of distribution of Morphlängen , law of distribution of rhythmical units of different length , the law of distribution of sentence lengths , law of distribution of syllable length ).

Investigations on word length distributions in German

The empirical finding for German is that the monosyllabic words are always the most common, from Old High German, with all authors, in all types of text, etc., followed by the two-syllable words, etc. With almost 2000 texts, there was always the same result. All texts except 5 correspond to the Hyperpoisson distribution.

An example of a word length distribution (measured as the number of syllables per word) in a letter from Kurt Tucholsky :

x n (x) NP (x)
1 522 521.4
2 250 247.56
3 87 92.69
4th 32 28.64
5 7th 7.53
6th 2 2.18

(Where x is the number of syllables per word, n (x) is the number of words with x syllables observed in this text; NP (x) is the number of words with x syllables, which is calculated using the Hyperpoisson distribution the observed data. Result: the Hyperpoisson distribution is a good model for this text with the test criterion P = 0.85, where P is considered good if it is greater than / equal to 0.05 referenced literature.)

The word length distribution of this text is quite typical for German: the most common are the words that consist of only one syllable; it is followed by the two-, then the three-syllable, etc. Irregularities only occur in the rare classes of long words.

Such irregularities disappear when you have very large files. The reference to an example may Kaedings dictionary of the frequency used, the article word length is presented. The Hyperpoisson distribution can also be adapted to this data with a very good result.

Special case: length of compound words

The length of compound words can be viewed as a special case of word lengths. Their length can be determined according to how many lexemes they are made up of. Using the example of compound words in a corpus of advertising texts, the following result was achieved:

x n (x) NP (x)
1 192 192.41
2 63 60.66
3 10 12.66
4th 1 1.97
5 1 0.24
6th 1 0.06

(Here x = 1: compound, consisting of 2 lexemes, x = 2: compound, consisting of 3 lexemes, and so on; n (x) is the number of compounds with x lexemes observed in this text corpus; NP (x) is the Number of compounds with x lexemes that is calculated when the Hyperpoisson distribution is fitted to the observed data Result: the Hyperpoisson distribution is a good model for this text corpus with the test criterion P = 0.34, where P is considered good if it is greater than / equal to 0.05.)

An investigation into the length of compound words in press releases from GeoEpoche and FAZ showed that the difference between the composites, which consist of two lexemes, and the tripartite composites was much stronger than in the advertising copy. Here, too, a distribution could be adapted with success. The result was confirmed by further investigations into German press releases. (For more detailed explanations, please refer to the literature given.)

Findings in other languages

In other languages, the monosyllabic words are often not the most common, but the two- or even the three-syllable words. This depends on the morphology of the languages. Languages ​​where the monosyllabic words do not appear most often in texts include Finnish and Latin. Another example of this is Japanese. Sanada examined a section of the dictionary of Japanese by determining the word lengths according to the number of moras per word and found that the 1-shifted binomial distribution makes a good model for this phenomenon:

x n (x) NP (x)
1 6th 9.06
2 109 129.36
3 661 615.47
4th 954 976.10

(Where x is the number of Mores per word, n (x) is the number of words with x Mores observed in this text; NP (x) is the number of words with x Mores that is calculated using the 1-shifted binomial distribution to the observed data. Result: The binomial distribution is a good model for this text with the test criterion C = 0.0047, where C is considered good if it is less than / equal to 0.01. Test criterion C is preferred here, since the total number of words n (x) is quite high; P is more suitable if the total number is significantly lower.)

Mohanty & Popescu present results for 13 Indian languages, for each of which 2 texts were examined with the Zipf-Alekseev function. Word lengths in 28 languages ​​are provided by Popescu et al. a. (2013), using different models. Also significant in the same volume is Lu Wang's study of word lengths in Chinese, separated by tokens and types, with different distributions being successfully tested. In addition, it was also possible to demonstrate that polysemy and word length are related: the longer a word, the lower the polysemy. Lu Wang thus confirms a relationship for Chinese that was demonstrated by Altmann, Beöthy and Best (1982) and Rothe (1983) for German, French, Portuguese, Slovak, Spanish and Hungarian.

Word lengths, determined by the number of their letters or phonemes

Until now, word lengths were determined by the number of their syllables. Along with morphs, syllables can be viewed as direct constituents of the words. However, if you take letters or phonemes as the criterion for word length, i.e. their indirect constituents, you get much longer tables, since words can be found with almost 70 letters, even if not very often. In a study of a number of languages ​​it was found that a mathematical model, namely the Good distribution, can also be used successfully in these cases.

Result and perspective

The very extensive findings on word length distributions in many different languages ​​and language stages particularly support the general hypothesis of quantitative linguistics that theoretically justifiable laws apply in the language system and use as well as in language change.

In the meantime, it has been confirmed by a number of studies that there are a number of regular dependencies between word length and other language properties within individual languages; compare in particular the article Linguistic Synergetics . Especially for the dependence of the word length on the word frequency, see.

literature

  • Karl-Heinz Best (Ed.): Glottometrika 16. The Distribution of Word and Sentence Length . Wissenschaftlicher Verlag Trier, Trier 1997. ISBN 3-88476-276-1 .
  • Karl-Heinz Best: Quantitative Linguistics. An approximation . 3rd, heavily revised and expanded edition. Peust & Gutschmidt, Göttingen 2006. ISBN 3-933043-17-4 .
  • Karl-Heinz Best: word length . In: Reinhard Köhler, Gabriel Altmann, & Rajmund G. Piotrowski (eds.): Quantitative Linguistics - Quantitative Linguistics. An international manual . de Gruyter, Berlin / New York 2005, pp. 260–273. ISBN 3-11-015578-8 .
  • Karl-Heinz Best: word lengths in German . In: Göttinger Contributions to Linguistics 13, 2006, 23–49.
  • Peter Grzybek: History and Methodology of Word Length Studies. The State of the Art. In: Peter Grzybek (Ed.): Contributions to the Theory of Text and Language. Word Length Studies and Related Issues . Springer, Dordrecht (NL), 2006, pp. 15–90. ISBN 1-4020-4067-9 (HB)
  • Thomas Jahn, Annika Uckel: Distribution of word lengths in English spam e-mails . In: Glottometrics 17, 2008, pages 1-7. (PDF full text )
  • Ioan-Iovitz Popescu, et alii: Word length: aspects and languages. In: Reinhard Köhler, Gabriel Altmann (eds.): Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the occasion of his 70th birthday . Lüdenscheid: RAM-Verlag 2013, p. 224-281. ISBN 978-3-942303-12-5 .
  • Ioan-Iovitz Popescu, Karl-Heinz Best, Gabriel Altmann: Unified Modeling of Length in Language . RAM-Verlag, Lüdenscheid 2014. ISBN 978-3-942303-26-2 . (Chapter Word length, pages 14-86, Length of compounds, pages 87-88.)
  • Otto Rottmann: On Word Lenth in German and Polish. In: Glottometrics 42, 2018, pages 13-20. (PDF full text )
  • Peter Schmidt (Ed.): Glottometrika 15. Issues in General Linguistic Theory and the Theory of Word Length . Wissenschaftlicher Verlag Trier, Trier 1996, pp. 102–111. ISBN 3-88476-228-1
  • Gejza Wimmer, Gabriel Altmann : Thesaurus of univariate discrete probability distributions. Stamm, Essen 1999. ISBN 3-87773-025-6
  • Gejza Wimmer, Gabriel Altmann: Towards a Unified Derivation of Some Linguistic Laws . In: Peter Grzybek (ed.): Contributions to the Science of Text and Language: Word length studies and related issues . Springer, Dordrecht 2006, pp. 329–337. ISBN 1-4020-4067-9 (HB)
  • Gejza Wimmer, Viktor Witkovský, Gabriel Altmann: Modification of Probability Distributions Applied to Word Length Research. In: Journal of Quantitative Linguistics 6, 1999, 257-268.

bibliography

  • Bibliography of Word Length. In: Glottometrics 34, 2016, pages 84-89 (PDF full text ). (Bibliography on the law of the distribution of word lengths)

See also

Web links

Wiktionary: word length  - explanations of meanings, word origins, synonyms, translations
Wiktionary: word length distribution  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. ^ Gejza Wimmer, Reinhard Köhler, Rüdiger Grotjahn & Gabriel Altmann: Towards a Theory of Word Length Distribution . In: Journal of Quantitative Linguistics 1, 1994, 98-106; Gejza Wimmer, Gabriel Altmann: The Theory of Word Length Distribution: Some Results and Generalizations. In: Peter Schmidt (Ed.): Glottometrika 15 . Wissenschaftlicher Verlag Trier, Trier 1996, pp. 112-133.
  2. ^ Karl-Heinz Best, Sergej Viktorovič Čebanov: Biographical note: Sergej Grigor'evič Čebanov (1897-1966) . In: Karl-Heinz Best (Ed.): Frequency distributions in texts . Peust & Gutschmidt, Göttingen 2001, pp. 281–283; Sergej Viktorovič Čebanov: O podčinenii rečevych ukladov 'indoevropejskoj' gruppy zakonu Puassona . In: Doklady Akademii Nauk SSSR. Tom 55/2, 1947, pp. 103-106 . (= On conformity of language structures within the Indoeuropean family to Poisson's law .); William P. Elderton: A few statistics on the length of English words. In: Journal of the Royal Statistical Society, Series A (General) , Volume CXII, Part IV, 1949, pp. 436-445 .; Wilhelm Fucks: Theory of word formation . In: Mathematical-Physical Semester Reports . Vol. 4, 1955, pp. 195-212.
  3. Best, Karl-Heinz (2009): William Palin Elderton (1877-1962). In: Glottometrics 19, p. 99-101. (PDF full text )
  4. Stefan Ammermann: On the word length distribution in German letters over a period of 500 years . In: Karl-Heinz Best (Ed.): Frequency distributions in texts . Peust & Gutschmidt, Göttingen 2001, pp. 59–91. P. 81
  5. See Best (2006), page 41.
  6. ^ Bernhard Sowinski: Advertisements and mailings. Oldenbourg, Munich 1979, page 110, ISBN 3-486-03931-8 ; Bernhard Sowinski: Advertising . Niemeyer, Tübingen 1998, page 67, ISBN 3-484-37104-8 .
  7. Best 2006, page 47.
  8. Stefanie Poppe: The distribution of compound lengths in German journalistic texts . In: Göttinger Contributions to Linguistics , 15, 2007, pages 79–85; Popescu, Best, Altmann 2014, pp. 87–88.
  9. ^ Karl-Heinz Best: Lengths of Compounds in German , in: Glottometrics 23, 2012, pp. 1–6 (PDF full text ).
  10. Haruko Sanada: Investigations in Japanese Historical Lexicology (Revised Edition) . Peust & Gutschmidt, Göttingen 2008, p. 96f. ISBN 978-3-933043-12-2 .
  11. Panchanan Mohanty, Ioan-Iovitz Popescu: Word length in Indian languages ​​1 , in: Glottometrics 29, 2014, pp. 95-109 (PDF full text )
  12. Ioan-Iovitz Popescu, et alii: Word length: aspects and languages. In: Reinhard Köhler, Gabriel Altmann (eds.): Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the occasion of his 70th birthday . Lüdenscheid: RAM-Verlag 2013, pp. 224–281. ISBN 978-3-942303-12-5 .
  13. ^ Lu Wang: Word length in Chinese. In: Reinhard Köhler, Gabriel Altmann (eds.): Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the occasion of his 70th birthday . Lüdenscheid: RAM-Verlag 2013, pp. 39–53. ISBN 978-3-942303-12-5 .
  14. G. Altmann, E, Beöthy and K.-H. Best: The set of meanings and Menzerath's law. In: Journal for Phonetics, Linguistics and Communication Research 35, pp. 537-543.
  15. U. Rothe: Word length and amount of meanings: An investigation into Menzerath's law in three Romance languages. In: R. Koehler, J. Boy (eds.): Glottometrika 5. Brockmeyer, Bochum 1983, pp. 101-112. ISBN 3-88339-307-X .
  16. See article: Word length : Shortest words - longest words .
  17. Mats Eeg-Olofsson: A word length regularity and its genesis , in: Glottometrics 19, 2009, pp. 49-69 (PDF full text )
  18. Archived copy ( memento of the original from February 15, 2015 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / lql.uni-trier.de