Benford's law

from Wikipedia, the free encyclopedia

The Benford law , also Newcomb Benford's Law (NBL), describes a regularity in the distribution of the leading digits of numbers in empirical data records when the values underlying have a sufficiently large spreading width.

The law can be observed, for example, in data sets on the population of cities, amounts of money in accounting, natural constants, etc. In short, it says:

The lower the numerical value of a sequence of digits of a certain length at a certain point in a number, the more likely it is to occur. For example, the following applies to the initial digits in numbers in the decimal system: Numbers with the first digit 1 occur about 6.6 times as often as numbers with the first digit 9.

discovery

In 1881 this law was discovered by the astronomer and mathematician Simon Newcomb and published in the American Journal of Mathematics . He had noticed that in the books with log tables that were used, the pages with tables with one as the first digit were significantly dirtier than the other pages because they had apparently been used more often. Newcomb's treatise went unnoticed and had already been forgotten when the physicist Frank Benford (1883–1948) rediscovered the same law and published it again in 1938. Since then it has been named after him, but more recently the name "Newcomb-Benford's Law" (NBL) has also been given to the original discoverer. Not many statisticians were aware of the existence of such a law until the American mathematician Theodore Hill tried to make the Benford distribution useful for solving practical problems, thereby making it much better known.

Benford distribution

Benford's law

Benford's law states that for randomly given numbers, the digit with probability

will appear as the first (non-zero) digit in the decimal representation of numbers.

Graphic representation

Benford's Law states in its simplest consequence that the leading digits appear with the following probabilities :, or

Graphic representation of the table
Leading digit probability
1 30.1%
2 17.6%
3 12.5%
4th 9.7%
5 7.9%
6th 6.7%
7th 5.8%
8th 5.1%
9 4.6%

generalization

Given a set of numbers that obey Benford's law. Then the probability of the occurrence of the digit to the base at the -th place (counting from the beginning, starting with 0):

where the Gaussian bracket denotes.

Especially for the first digit, the formula is simplified to

It is easy to check that the sum of the probabilities of all the different digits at a certain point results in 1, since the sum results in a telescope sum after applying the logarithmic law already used for the first point .

Validity of the NBL

A data set is a Benford variable (that is, Benford's law applies to this data set) if the mantissas of the logarithms of the data set are uniformly distributed within the limits 0 to 1; this is generally the case when the variance within the data set does not fall below a certain minimum value that is dependent on the class of distribution according to which the logarithms of the data set are distributed.

With the Fibonacci numbers (each Fibonacci number is the sum of its two predecessors), the first digits of the first 30 numbers already result in a distribution that is amazingly close to a Benford distribution. This also applies to similar sequences with changed starting numbers (e.g. the Lucas sequences ). Many sequences of numbers obey Benford's law, but many others do not, so they are not Benford variables.

Why many data sets follow the NBL

Relative frequencies of the initial digits 1 (red) and 9 (blue) among the numbers from 1 to n: Only for n = 10 k  - 1 do they both agree.

The NBL applies to real data sets (which means here those that were not subject to manipulation) that are sufficiently extensive and have numbers in the order of magnitude of at least , i.e. data that are fairly widely distributed (dispersed). It says that the probability of occurrence of the digit sequences in the numbers is not evenly distributed, but follows logarithmic laws. This means that the probability of occurrence of a sequence of digits, the smaller it is in terms of value and the further to the left it begins in the number. The most common is the initial sequence "1" with theoretically 30.103%. The NBL is based on the uniform distribution of the mantissas of the logarithms of the numerical values ​​of the data set. The reason for the astonishingly wide validity of the NBL is due to the fact that many real data sets are log-normally distributed , i.e. not the frequencies of the data themselves, but the orders of magnitude of these data follow a normal distribution. If the dispersion of the normally distributed logarithms is sufficiently broad (if the standard deviation is at least approximately equal to 0.74), the mantissas of the logarithms stably follow a uniform distribution. If the standard deviation is smaller, however, the mantissas are normally distributed and the NBL no longer applies, at least not in the simple form shown. If the standard deviation is less than 0.74, the effect, which is not very common in statistics, is that even the respective mean value of the normal distribution of the logarithms influences the frequency of occurrence of the digit sequences.

If, on the one hand, one proceeds from the NBL in its current form, there are numerous data sets that do not satisfy the NBL. On the other hand, there is already a formulation of the NBL in such a way that all data sets are sufficient for it.

Benford's law applies in particular to numerical material that is subject to natural growth processes. Then the numbers change over time and multiply. The first position of the mantissa remains on the 1 for about 30% of the time, 18% of the time on the 2, etc .: This corresponds to the logarithmic distribution that Benford's law predicts and is independent of the time in which a multiplication he follows. Then the cycle starts all over again at 1. With a snapshot of the prices of a supermarket, you will find exactly this distribution, regardless of when the survey is carried out.

Scale invariance

Data records with Newcomb-Benford distributed initial digits, multiplied by a constant, are again Benford distributed. Multiplying the data by a constant corresponds to adding a constant to the logarithms. If the data is sufficiently widely distributed, the distribution of the mantissas does not change.

This property directly explains why Newcomb-Benford's law applies to tax returns, balance sheets, etc., or generally to data records whose figures represent amounts of money. If there is a universally valid distribution of the initial digits in such data sets, then this distribution must be independent of the currency in which the data is given, and the universal distribution must not change due to inflation. Both mean that the distribution must be scale-invariant. Since the Newcomb-Benford distribution is the only one that satisfies this condition, it must therefore be this.

Base invariance

A data set that satisfies Benford's law on basis B 1 also satisfies Benford's law on basis B 2 . More specifically, a decadic data set that fulfills Benford's law also fulfills Benford's law if the decadic numbers are converted into another number system (e.g. into binary, octal or hexadecimal).

Applications

If real data sets do not comply with Benford's law, despite the parametric requirements being met, in that the number of occurrences of a certain digit deviates significantly from the expectation specified by Benford's law, then an examiner will subject those data sets that begin with this digit to a more detailed analysis to find the cause (s) of these discrepancies. This quick procedure can lead to deeper knowledge about the particularities of the examined data set or to the detection of manipulations during data creation.

example

Distribution of the first digits of a table with 87 digits (see text)

A table reports the harvest results from 2002 . In the diagram, the blue bars indicate the frequency of the first digits of the 87 recorded numbers. The Benford distribution is shown as a red line. It reflects the distribution much better than an even distribution (green line). Despite the small sample, the preference for small values ​​can be seen in the first digit, as well as a tendency for the second digit.

The table summarizes the results. The 1st digit column shows how often the digit appears in the first position, the Benford column shows how often it is expected there according to the Benford distribution. The same applies to the number of numbers with the second digit in the 2nd digit column . The number 1 then appears in the first position 27 times, the expected 26.19 times. The number 4 comes first 17 times, according to Benford it should appear on average 8.43 times.

As the value of the digit decreases, the Benford distribution given above approaches the equal distribution of the digits more and more.

Digit 1st digit Benford 2nd digit Benford
0 - - 9 10.41
1 27 26.19 17th 9.91
2 15th 15.32 9 9.47
3 7th 10.87 11 9.08
4th 17th 8.43 5 8.73
5 4th 6.89 9 8.41
6th 5 5.82 7th 8.12
7th 4th 5.05 8th 7.86
8th 5 4.45 7th 7.62
9 3 3.98 5 7.39
total 87 87

In business

Benford's Law applies to the detection of fraud in the preparation of the balance sheet, the falsification of accounts, and generally to the rapid detection of blatant irregularities in accounting. With the help of Benford's law, the remarkably “creative” accounting system at Enron and Worldcom was uncovered, through which the management had cheated investors out of their deposits (→ economic crime ). Today accountants and tax investigators use methods based on Benford's law. These methods represent an important part of the mathematical-statistical methods that have been used for several years to uncover falsified accounts, tax and investor fraud and, in general, data fraud. It could also be shown that the leading digits of the market prices also follow Benford's law. The manipulation of Greece's economic data could also be proven using Benford's law.

In research

Benford's Law can also help in uncovering data falsification in science. It was data sets from the natural sciences that led to Benford's law. Karl-Heinz Tödter from the Research Center of the Deutsche Bundesbank used the same law to review the results of 117 economic papers in a contribution to the German Economic Review .

elections

With the help of Benford's law, political scientists examined the results of several federal elections (from 1990–2005) at constituency level and occasionally (4 cases in 1500 tests) came across significant irregularities regarding the first vote. However, when looking at the second vote, i.e. the direct party election, irregularities were observed in 51 of 190 tests.

There were also indications of possible forgeries in the context of the 2009 presidential elections in Iran .

However, other experts see Benford's Law as completely unsuitable for examining elections.

Size of cities in Germany

Distribution of the size of major German cities

The right figure shows the size distribution of German cities. The graphic shows the population of the 998 largest cities. A Benford analysis provides the following frequencies of the initial digits:

Digit Measured Expected
1 340 300.4
2 320 175.7
3 133 124.7
4th 87 96.7
5 50 79.0
6th 24 66.8
7th 20th 57.9
8th 12 51.1
9 12 45.7

The frequency of the numbers 3 and 4 corresponds to the expectation. In contrast, the number 1 occurs more frequently. The deviation of number 2 is particularly pronounced at the expense of numbers 7, 8 and 9, which are rarely observed in the first place .

This example again shows that data sets must meet certain requirements in order to meet the NBL; this data set does not do this. The reason for this is the restriction to cities, the distribution of all municipalities should result in a more precise match. In addition, there is a natural minimum settlement size, and amalgamation of municipalities also has an impact on the distribution. Curiously, even about 50% of the examples that Benford cited in his publication as evidence for the NBL belong to the class of data sets that do not have any Benford-distributed initial digits, but rather a roughly similar distribution of the initial digits.

significance

How large the deviations of the observed distribution from the theoretically expected distribution must be at least so that a justified suspicion of manipulation can be seen as substantiated is determined with the help of mathematical-statistical methods (e.g. the chi-square test or the Kolmogorow -Smirnow test , "KS test"). For the test of test should be about random deviations in the first digit, a sample from 109 numbers satisfy ( is true for all ). If the samples are much smaller, the results of the chi-square test can be challenged and the KS test may be too tolerant. In such a case, e.g. For example, a very complex but exact test based on the multinomial distribution can be used. In addition, the data in the data set must be statistically independent of one another. (Therefore, numbers such as the Fibonacci sequence cannot be tested for significance with the chi-square goodness-of-fit test, as the resulting result becomes unreliable.)

The fact that balance lists, invoice lists and similar statements behave in accordance with Benford's law is due to the fact that the majority of such series of numbers are collections of numbers that have gone through a wide variety of arithmetic processes and therefore behave like quasi-random numbers . If the business and booking processes are allowed to run free, the laws of chance come into effect from a certain business size and Benford's law therefore also applies. However, if these numbers are consistently influenced in the course of a billing period, by often refining them, making certain numbers disappear or inventing them, or even manipulating processes due to given competence restrictions, then the chance is noticeably disturbed. These disturbances manifest themselves in significant deviations from the theoretically expected number distribution.

In practice, it is often found that the conventional significance tests used in Benford analyzes are not entirely reliable. In addition, the data of a data set are sometimes not completely independent of each other, which is why one z. B. may not use the chi-square test. Work is underway on the development of significance tests that are better adapted to the NBL.

Example: If an employee is allowed to place orders of up to EUR 1000 without the approval of the management and, when offers are higher than EUR 1000, he often divides the orders into several smaller items in order to save himself the trouble of approval, then the Benford -Distribution of the order amounts significant deviations from the theoretical expectation.

Significance test for deviation from the Benford distribution using the chi-square test

However, this example also shows that statistical methods cannot detect individual irregularities. Some consequence of the manipulations is required. The larger the sample, the more sensitive a significance test generally reacts to manipulation.

Test for significant deviations

Benford analyzes are believed to be the simplest of analyzes in mathematical statistics. The example below is the result of counting the first digits of a sample of 109 sums from a list. The real (observed) counting results are compared with the counting results to be expected with 109 initial digits and examined using a chi-square test to determine whether the discrepancies found can be random or can no longer be explained by chance alone . In this example, it is assumed as a decision criterion that over- randomness is to be assumed if the probability of the random occurrence of the observed distribution or an at least as improbable one is less than or equal to 5% ( statistical test ). Since in our example 52% of all distributions have these or higher deviations, an examiner will not reject the hypothesis that the deviations were caused by chance .

In-depth Benford analyzes

If there are very long lists with several thousand numbers, a Benford test cannot only be carried out with the first digit. Such an abundance of data makes it possible to check the 2nd, 3rd, total 1st + 2nd, possibly even total 1st + 2nd + 3rd digit simultaneously (for these, however, you should have at least 11,500 numbers, otherwise the chi-square test could give uncertain results). Benford distributions also exist for these tests, although they are somewhat more extensive. So z. B. the theoretical expectation for the appearance of the initial digits 123… 0.35166%, whereas only 0.13508% of all numbers have the initial digits 321….

The rule always applies that the lower the value of the digits, the more they follow an even distribution . Cent amounts follow an almost exactly equal distribution, which means that the logarithmic approach is generally unnecessary for cent amounts. For very small currencies are tests to equal distribution of fractional coins amounts (z. B. kopeck -RUS, Heller CZ, Fillér-H, Lipa-HR) blurred, as is often rounded in practice. Large currencies (US dollars, pounds sterling, euros) usually allow such tests.

Estimation and planning of company sales

Benford's law can also be used to estimate company turnover figures. It is assumed that the logarithms of all billing amounts of a company follow approximately a normal distribution. The initial digits of the invoice amounts thus follow the Benford distribution with an expected value of around 3.91. The distance between the logarithm of the smallest and the logarithm of the largest invoice amount represents approximately 6 times the standard deviation of the normal distribution of the logarithms. With the knowledge of the highest invoice amount and the number of valid invoices from which the estimated turnover is composed, a useful estimate of the turnover is possible, as the following example shows from practice. The place value in the table indicates the number before the decimal point of the logarithm. Actual sales were 3.2 million currency units. However, sales estimates are not always that close to the actual result. If the assumption of the normal distribution for the orders of magnitude does not apply, one must choose an estimated distribution that is more similar to the real one. In most cases, the orders of magnitude of the invoice amounts then follow a logarithmic normal distribution .

Total sales estimate

Although the actual distribution of the invoice amounts will only coincide with that of the estimate by chance, the sum of all estimation errors per place value is almost always compensated to a rather small amount.

This method can also be used in the context of planning company sales to check the plausibility of planned sales, which are mostly the result of estimates and extrapolations of empirical values ​​from sales-oriented departments, by determining how many invoices are expected to achieve the specified sales how high the highest invoice amount will be. This analysis often shows that such estimates, on which the planning is based, cannot be relied too much. The benford analysis then gives the sales department the feedback to correct their expectations based on reality.

If one assumes that the logarithms of the individual sales are evenly distributed, the sales are quasi "logarithmically evenly distributed". The density function of the sales then has a histogram which, given a suitable classification of the distribution of the digit sequences (e.g. nine classes, compared to first digit) looks very similar to the Benford distribution.

Generation of Benford distributed initial digits

The generation of practically random numbers with Benford distributed initial digits is quite easy with the PC.

Evenly distributed numbers

The function generates numbers with Benford distributed initial digits for . A random, uniformly distributed positive whole number is from a fixed interval, and is a uniformly distributed random number between 0 and 1.

Normally distributed numbers

The function generates for , with as uniformly distributed random variables, numbers with approximately normally distributed orders of magnitude of and Benford distributed initial digits. For practical purposes, relatively high should be chosen . As is , you can see that the distribution of the numbers resembles a lognormal distribution. Is , the generated initial digits of the numbers are generally no longer Benford distributed. For applications in practice, the wide spread of the orders of magnitude of that the square of the tangent function  generates - especially for large ones - is in many cases not optimal.

literature

  • F. Benford: The Law of Anomalous Numbers. In: Proceedings of the American Philosophical Society (Proc. Amer. Phil. Soc.). Philadelphia 78.1938, pp. 551-572. ISSN  0003-049X
  • Simon Newcomb : Note on the Frequency of the Use of different Digits in Natural Numbers. In: American journal of mathematics (Amer. J. Math.). Baltimore 4.1881, pp. 39-40, ISSN  0002-9327 .
  • Mark J. Nigrini: The Detection of Income Tax Evasion Through an Analysis of Digital Frequencies. Dissertation. University of Cincinnati. UMI, Ann Arbor Mich 1992. (microfiche).
  • Ian Stewart : The Law of the First Digit. In: Spektrum der Wissenschaft , Heidelberg 1994.4 (Apr.), pp. 16 ff., ISSN  0170-2971 .
  • H. Rafeld: Digital digit analysis with Benford's Law for the revision of fraudulent acts. Thesis. University of Cooperative Education Ravensburg, Ravensburg 2003.
  • Peter N. Posch: Digit analysis in theory and practice - test procedure for detecting forgeries with Benford's law. 2nd Edition. European Economy, Berlin 2005, ISBN 3-8322-4492-1 .
  • Tarek el Sehity, Erik Hoelzl, Erich Kirchler: Price developments after a nominal shock, Benford's Law and psychological pricing after the euro introduction. In: International Journal of Research in Marketing. 22 Amsterdam 2005, No. 4, December 2005, pp. 471-480, doi: 10.1016 / j.ijresmar.2005.09.002 , ISSN  0167-8116 .
  • S. Günnel, K.-H. Tödter: Does Benford's law hold in economic research and forecasting? ( Memento of October 17, 2010 in the Internet Archive ). (PDF) In: Deutsche Bundesbank Discussion Paper. Series 1. Economic Studies. Frankfurt am Main 32/2007.
  • H. Rafeld, F. Then Bergh: Digital digit analysis in German accounting data . In: Journal Interne Revision , Berlin 42.2007,1, pp. 26–33, ISSN  0044-3816 .
  • Arno Berger, Theodore Hill: Benford's law strikes back: no simple explanation in sight for mathematical gem. In: Mathematical Intelligencer. 2011, No. 1, pp. 85-91.
  • Arno Berger, Theodore Hill: What is Benfords Law? (PDF) AMS Notices, February 2017 (PDF; 126 kB).
  • Ehrhard Behrends: Benford's Law or why the number 1 is more often at the beginning. In: The world . April 4, 2005.

Web links

Individual evidence

  1. Tarek el Sehity, Erik Hoelzl, Erich Kirchler: Price Developments after a nominal shock, Benford's Law and psychological pricing after the euro introduction. In: International Journal of Research in Marketing , 22, Amsterdam 2005, No. 4, December 2005, pp. 471-480, doi: 10.1016 / j.ijresmar.2005.09.002
  2. TU Ilmenau proves Greece's cheating. Retrieved October 25, 2011 .
  3. Hans Christian Müller: On the hunt for counterfeiters. In: Handelsblatt , November 30, 2009.
  4. Christian Breunig, Achim Goerres: Searching for electoral irregularities in an established democracy: Applying Benford's Law tests to Bundestag elections in Unified Germany . In: Electoral Studies (=  Special Symposium on the Politics of Economic Crisis ). tape 30 , no. 3 , September 1, 2011, p. 534–545 , doi : 10.1016 / j.electstud.2011.03.005 ( achimgoerres.de [PDF; accessed on May 8, 2017]). achimgoerres.de ( Memento from August 8, 2017 in the Internet Archive ; PDF)
  5. ^ Boudewijn F. Roukema: Benford's Law anomalies in the 2009 Iranian presidential election . arxiv : 0906.2789v1 .
  6. Joseph Deckert, Mikhail Myagkov and Peter C. Ordeshook: The Irrelevance of Benford's Law for Detecting Fraud in Elections. (PDF) Caltech / MIT Voting Technology Project Working Paper No. 9, 2010 ( at archive.org ( Memento from May 17, 2014 in the Internet Archive )).
  7. ^ Charles R. Tolle, Joanne L. Budzien, Randall A. LaViolette: Do dynamical systems follow Benford's Law? In: Chaos , 10, 2, 2000, pp. 331-336, doi: 10.1063 / 1.166498 .
  8. Page no longer available , search in web archives:@1@ 2Template: Toter Link / bevoelkerungsstatistik.de
This version was added to the list of articles worth reading on July 3, 2005 .