Frequency class
In linguistics, a frequency class is a statistical measure of the frequency of use of a word in a natural language or in a language segment. Zipf's law is used to calculate the frequency class. As a law of language, it is of particular importance in quantitative linguistics . In corpus linguistics , frequency classes have also established themselves as an empirical measure of frequency.
calculation
A representative and sufficiently large amount of available written sources from a language called the corpus is used as the basis for the calculation . The most common word in this corpus is used as a basis for comparison. In the written German language this is the word der , in the English the ("der / die / das"), in the Swedish och ("und").
The Zipf law serves as a basis for calculation. The value of the frequency class is calculated using the logarithm to base 2 of the quotient of the word to be examined and the word that occurs most frequently.
The Gaussian bracket rounds the intermediate result down to a whole number. Together with the added value 0.5, the Gaussian bracket leads to the value of the logarithm being rounded up or down to the nearest whole number (0.5 is rounded up to 1).
The frequency class calculated in this way is an integer that expresses how many times more frequently the most frequent word occurs in the analyzed data than the examined word. The most common word itself belongs to frequency class 0, and i. A. It is the only representative of this class. Words that occur about times as often as this are placed in the frequency class. This means that the smaller its frequency class, the more frequently a word occurs.
Size of the frequency classes, rank
According to Zipf's law, one expects that the class contains around words ( types ) and that the sum of their occurrences ( tokens ) is roughly the same in each class, although this approximation is the least accurate for the top and bottom classes. In particular, according to Zipf's law, one expects for each corpus that approximately half of all words (types) appear only once.
The first estimate from Zipf's law is that class 0 contains about 1 word, class 1 about 2 words, class 9 about 512 words, etc. In all classes up to and including class 9 there are about 1000 words . The following frequency class 10 thus includes around 1000 words with a frequency of around 1000 to 2000; however, these are only rough guide values.
Word forms and lexemes
Frequency classes can be viewed on two linguistic levels: for a single word form (as shown above) or for an entire lexeme with its various word forms. The most frequently occurring word, the frequency of which is used as a benchmark when calculating the frequency class, should be determined on the same linguistic level: In written German , the most common word form is the word der and the most common lexeme is the specific article (with the inflected forms of the , die , das , des , dem , den ).
See also
literature
- Helmut Meier: Deutsche Sprachstatistik , 2nd edition, Olms, Hildesheim 1978, ISBN 9783487007359 .
Web links
- http://wortschatz.informatik.uni-leipzig.de - vocabulary lexicon of the University of Leipzig based on German sources with an indication of the frequency class
- DeReWo - corpus -based basic / word form lists of the Institute for German Language with indication of the frequency class
- Online calculator for frequency classes
Individual evidence
- ↑ This is more or less in line with practice: According to studies by the University of Leipzig, class 9 is assigned a log (number of words in HKL 9) of around 6.5, so there should be around 700 words in this class, see graphic 'Number of words in the frequency classes' ( Memento of the original from March 5, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. to FAQ on vocabulary, University of Leipzig ( memento of the original from November 12, 2015 in the Internet Archive ) Info: The archive link was automatically inserted and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. .
- ↑ Can be used e.g. B. to better interpret the information from the frequency queries.