Frequency class

from Wikipedia, the free encyclopedia

In linguistics, a frequency class is a statistical measure of the frequency of use of a word in a natural language or in a language segment. Zipf's law is used to calculate the frequency class. As a law of language, it is of particular importance in quantitative linguistics . In corpus linguistics , frequency classes have also established themselves as an empirical measure of frequency.

calculation

A representative and sufficiently large amount of available written sources from a language called the corpus is used as the basis for the calculation . The most common word in this corpus is used as a basis for comparison. In the written German language this is the word der , in the English the ("der / die / das"), in the Swedish och ("und").

The Zipf law serves as a basis for calculation. The value of the frequency class is  calculated using the logarithm to base 2 of the quotient of the word to be examined and the word that occurs most frequently.

The Gaussian bracket rounds the intermediate result down to a whole number. Together with the added value 0.5, the Gaussian bracket leads to the value of the logarithm being rounded up or down to the nearest whole number (0.5 is rounded up to 1).

The frequency class calculated in this way  is an integer that expresses how many times more frequently the most frequent word occurs in the analyzed data than the examined word. The most common word itself belongs to frequency class 0, and i. A. It is the only representative of this class. Words that occur about times as often as this are placed in the frequency  class. This means that the smaller its frequency class, the more frequently a word occurs.

Size of the frequency classes, rank

According to Zipf's law, one expects that the class contains  around words ( types ) and that the sum of their occurrences ( tokens ) is roughly the same in each class, although this approximation is the least accurate for the top and bottom classes. In particular, according to Zipf's law, one expects for each corpus that approximately half of all words (types) appear only once.

The first estimate from Zipf's law is that class 0 contains about 1 word, class 1 about 2 words, class 9 about 512 words, etc. In all classes up to and including class 9 there are about 1000 words . The following frequency class 10 thus includes around 1000 words with a frequency of around 1000 to 2000; however, these are only rough guide values.

Word forms and lexemes

Frequency classes can be viewed on two linguistic levels: for a single word form (as shown above) or for an entire lexeme with its various word forms. The most frequently occurring word, the frequency of which is used as a benchmark when calculating the frequency class, should be determined on the same linguistic level: In written German , the most common word form is the word der and the most common lexeme is the specific article (with the inflected forms of the , die , das , des , dem , den ).

See also

literature

  • Helmut Meier: Deutsche Sprachstatistik , 2nd edition, Olms, Hildesheim 1978, ISBN 9783487007359 .

Web links

Individual evidence

  1. This is more or less in line with practice: According to studies by the University of Leipzig, class 9 is assigned a log (number of words in HKL 9) of around 6.5, so there should be around 700 words in this class, see graphic 'Number of words in the frequency classes' ( Memento of the original from March 5, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. to FAQ on vocabulary, University of Leipzig ( memento of the original from November 12, 2015 in the Internet Archive ) Info: The archive link was automatically inserted and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. . @1@ 2Template: Webachiv / IABot / wortschatz.uni-leipzig.de @1@ 2Template: Webachiv / IABot / wortschatz.uni-leipzig.de
  2. Can be used e.g. B. to better interpret the information from the frequency queries.