# Zipf's law

The Zipf law (after George Kingsley Zipf , who set this law in the 1930s) is a model that allows one whose value can be estimated from their rank for certain sizes that are placed in a hierarchy. Frequent use is the law in linguistics (linguistics), especially in the corpus linguistics and Quantitative Linguistics , where it continues as the frequency of words in a text to rank in relationship. Zipf's Law marked the beginning of quantitative linguistics.

It is based on a power law that is described mathematically by the Pareto distribution or Zipf distribution .

## Easy zip distribution

The simplified statement of Zipf's law reads: If the elements of a set - for example the words of a text - are ordered according to their frequency, the probability of their occurrence is inversely proportional to their position within the ranking: ${\ displaystyle p}$ ${\ displaystyle n}$ ${\ displaystyle p (n) \ sim {\ tfrac {1} {n}}.}$ The normalization factor for elements is through the harmonic series${\ displaystyle N}$ ${\ displaystyle H_ {N} = \ sum _ {n = 1} ^ {N} {\ frac {1} {n}} \ approx \ ln (N) +0 {,} 577 \ approx \ ln (1 { ,} 78 \ cdot N)}$ given and can only be given for finite sets. So it follows:

${\ displaystyle p (n) = {\ frac {1} {H_ {N}}} \ cdot {\ frac {1} {n}} \ approx {\ frac {1} {n \ cdot \ ln (1 { ,} 78 \ cdot N)}}.}$ ## Probability distribution

Zipf's law has its origin in linguistics. It says that certain words appear much more frequently than others and the distribution is similar to a hyperbola . For example, in most languages, the longer they are, the less often words appear. The order parameter rank n can be described as a cumulative quantity: The rank is synonymous with the number of all elements that are equal to or greater than . There is exactly one element for rank 1 , namely the largest. For rank 2 there are two, namely the first and the second element, for 3 three, etc. ${\ displaystyle {\ tfrac {1} {n}}}$ ${\ displaystyle n}$ ${\ displaystyle n}$ Zipf takes a simple inverse relationship to the rank of: . In its original form, Zipf's law is free of parameters, it is . ${\ displaystyle y ({\ text {Rank}}) \ sim {\ text {Rank}} ^ {- a}}$ ${\ displaystyle a = 1}$ Zipf's distribution corresponds exactly to the Pareto distribution , swapping the ordinate and abscissa :

${\ displaystyle y (x) \ sim x ^ {- a} {\ text {(Zipf)}} \ Leftrightarrow x (y) \ sim y ^ {\ frac {-1} {a}} {\ text {( Pareto)}}}$ .

It is the inverse of the Pareto distribution. Like this, it is a cumulative distribution function that obeys a power law. The exponent of the distribution density function is accordingly: ${\ displaystyle e}$ ${\ displaystyle e = 1 + {\ frac {1} {a}}}$ and for the simple case : ${\ displaystyle a = 1}$ ${\ displaystyle e = 2}$ .

## Examples

The distribution of word frequencies in a text (left graphic) roughly corresponds qualitatively to a simple Zipf distribution.

The Zipf law gives the exponent of the cumulative distribution function before: . ${\ displaystyle a}$ ${\ displaystyle a = 1}$ The fit value for the word frequencies is, however , equivalent to the exponent of a Pareto distribution and the exponent of a power distribution density function of . ${\ displaystyle a = 0 {,} 83}$ ${\ displaystyle a _ {\ text {pareto}} = 1 {,} 20}$ ${\ displaystyle e}$ ${\ displaystyle e = 2 {,} 20}$ The distribution of the letter frequencies also resembles a Zipf distribution. However, statistics based on 20–30 letters are not sufficient to adapt the course with a power function.

Another example from the Pareto Distribution article deals with the size distribution of cities. Here, too , one can see a connection in individual countries (e.g. Germany) that seems to obey a power law, but with striking deviations. The graphic on the right compares the Zipf approximation with the measured values. The linear course in the logarithmic distribution supports the assumption of a power law. Unlike Zipf's conjecture, the exponent does not have the value 1, but the value 0.77, corresponding to an exponent of a power density distribution of . This theory, according to which the number of inhabitants and sizes of cities that develop independently of one another, nevertheless develop according to a superordinate law, is also used to determine the expected size of towns . ${\ displaystyle e = 2 {,} 3}$ The importance of the Zipf distribution lies in the rapid qualitative description of distributions from the most varied of areas, while the Pareto distribution refines the exponent of the distribution.

For example, the database is too small for a fit when specifying the population of only seven cities. Zipf's law provides an approximation:

Rank n city Residents 1 / rank p ( n ) p ( n ) people Deviation in percent %
1 Berlin 3522896 1 0.39 3531136.31 −0.23
2 Hamburg 1626220 0.5 0.19 1765568.15 −8.57
3 Munich 1206683 0.33 0.13 1177045.44 2.46
4th Cologne 946280 0.25 0.1 882784.08 6.71
5 Frankfurt 635150 0.2 0.08 706227.26 −11.19
6th Dortmund 594058 0.17 0.06 588522.72 0.93
7th eat 624445 0.14 0.06 504448.04 19.22

Reasons for the occurrence of power distributions are discussed under the keywords power law , scale law or self-organization .

## literature

• Helmut Birkhan : The "Zipfsche Law", the weak past tense and the Germanic sound shift (= Austrian Academy of Sciences. Philosophical-Historical Class. Meeting reports. 348). Publishing house of the Austrian Academy of Sciences, Vienna 1979, ISBN 3-700-10285-2 .
• David Crystal : The Cambridge Encyclopedia of Language . Campus-Verlag, Frankfurt am Main et al. 1993, ISBN 3-593-34824-1 .
• Xavier Gabaix : Zipf's law for cities: An explanation. In: The Quarterly Journal of Economics. Vol. 114, No. 3, 1999, pp. 739-767, doi : 10.1162 / 003355399556133 .
• Henri Guiter , Michail V. Arapov (Ed.): Studies on Zipf's Law (= Quantitative Linguistics. Vol. 16). Studienverlag Brockmeyer, Bochum 1982, ISBN 3-88339-244-8 .
• Matteo Marsili, Yi-Cheng Zhang: Interacting Individuals Leading to Zipf's Law. In: Physical Review Letters. Vol. 80, No. 12, 1998, pp. 2741-2744, doi : 10.1103 / PhysRevLett.80.2741 .
• George Kingsley Zipf: The Psycho-Biology of Language. An Introduction to Dynamic Philology. Mifflin, Boston MA 1935, (The MIT Press, Cambridge MA 1968).
• George Kingsley Zipf: Human Behavior and the Principle of Least Effort. An Introduction to Human Ecology. Addison-Wesley, Cambridge MA 1949.