Gini coefficient
The Gini coefficient or Gini index is a statistical measure that was developed by the Italian statistician Corrado Gini to represent inequalities . Inequality coefficients of distribution can be calculated for any distribution. For example, the Gini coefficient is used in economics, but also in geography, as a yardstick for the distribution of income and wealth in individual countries and thus as an aid to classifying countries and their associated level of development.
The Gini coefficient is derived from the Lorenz curve and has a value between 0 (with an even distribution) and 1 (when only one person receives the complete income, i.e. with maximum unequal distribution). With a uniform distribution is not uniform distribution meant in the probabilistic sense, but a distribution with a variance of 0. In the most common use case, the distribution of income in a country, it means that the income of each adult is the same, and not that different income (classes) are equally frequent.
Applications
Economy
The Gini coefficient is used in particular in welfare economics to describe, for example, the degree of equality or inequality in the distribution of wealth or income. The coefficient is an alternative to the S80 / S20 income quintile ratio, which is used in EU statistics.
Information theory
In information theory , it is used as a measure of the "purity" or "impurity" of information.
Machine learning
In the area of machine learning , when generating a decision tree, the Gini index, or more precisely the change in the Gini index, also called "Gini Gain", can be used as a criterion to select the decision rule in which the child nodes are as "pure" as possible . The idea is that with a "pure" decision the tree is ready, which is why changing the Gini index is suitable as a measure.
Banking
In banking , the Gini coefficient is used as a measure of how well a rating system can separate good from bad customers ( selectivity ).
Normalization
The scale of possible values ranges from 0 to 1, from 0 to 100, from 0 to 10000, depending on the application. Depending on the application, the smallest or the largest value stands for even distribution. The value of absolute inequality can generally only be reached asymptotically. This can be avoided by renormalizing.
definition
General case
For an ascendingly sorted, discretely distributed quantity (example: household income) the Lorenz curve is given by
For the position in the income distribution, the Lorenz curve therefore indicates the cumulative share of total income. denotes the arithmetic mean. With an even distribution, the area between the 45 degree line and the Lorenz curve would correspond to the value 0 and increase for more unequal distributions. From this consideration and the goal of obtaining a measure normalized to the interval , the Gini inequality coefficient results as By geometrically decomposing the area one obtains:
For a real distribution, one can calculate the Gini coefficient directly as follows (using ):
An alternative formulation that does not require the data to be sorted is based on the so-called relative mean absolute difference . The mean absolute difference denotes the mean difference of all pairs of observations present in a population. This is set in relation to the average income. So that the Gini coefficient assumes the desired range of values, the difference is divided by 2:
Calculation based on quantiles
A certain part of a set A is assigned to a part of another set B. This can e.g. B. Money (A) on people (B) or electricity consumption (A) on cities (B). It is crucial that A represents a homogeneous, easily divisible set. For example, owning a motor vehicle would not be suitable because motor vehicles are neither homogeneous - individual types differ considerably - nor can they be divided into small units.
The Gini coefficient is the area normalized to the uniform distribution between the Lorenz curves of a uniform distribution and the observed distribution.
with GUK as the Gini inequality coefficient , the area under the Lorenz curve of a uniform distribution, and the area under the Lorenz curve for the observed distribution.
example
A is distributed to B, for example the wealth (A) is distributed to the population (B).
50 Prozent von B (b_{1}) wird 2,5 Prozent von A zugeordnet (v_{1}). 40 Prozent von B (b_{2}) wird 47,5 Prozent von A zugeordnet (v_{2}). 9 Prozent von B (b_{3}) wird 27,0 Prozent von A zugeordnet (v_{3}). 1 Prozent von B (b_{4}) wird 23,0 Prozent von A zugeordnet (v_{4}).
In a first step, the data are displayed "normalized":
b_{1} = 0,50 v_{1} = 0,025 v_{1}/b_{1} = 0,05 b_{2} = 0,40 v_{2} = 0,475 v_{2}/b_{2} = 1,188 b_{3} = 0,09 v_{3} = 0,270 v_{3}/b_{3} = 3 b_{4} = 0,01 v_{4} = 0,230 v_{4}/b_{4} = 23
In the second step, the Gini coefficient is calculated.
The Gini unequal distribution coefficient (GUK) is obtained by evaluating a Lorenz curve .
In order to actually produce a Lorenz curve, the above values may have to be rearranged. All value pairs must first be pre-sorted in such a way that:
In the above example, the correct sorting is already in place, so that there is no need to re-sort.
The Lorenz curve you are looking for arises when you enter (x _{i} , y _{i} ) pairs as points in a Cartesian coordinate system and then connect neighboring points with a straight line. The -pairs result from the -pairs according to the following calculation rule:
In the second step, the following data is determined from the data of the first step by summation (with (0, 0) added as a fixed value at the beginning):
x_{0} = 0,00 y_{0} = 0 x_{1} = 0,50 y_{1} = 0,025 x_{2} = 0,90 y_{2} = 0,5 (da 0,5 + 0,4 = 0,9 und 0,025 + 0,475 = 0,5 ist) x_{3} = 0,99 y_{3} = 0,77 x_{4} = 1,00 y_{4} = 1
With total equal distribution of wealth , the Lorenz curve is a straight line from point (0 | 0) to point (1 | 1).
To determine the Gini coefficient, two quantities are first determined, which are graphically considered areas. Once the area under the uniform distribution line, let's call this quantity A. The second area is the area under the actual distribution curve , let's call this quantity B. With these two quantities, the Gini inequality coefficient is calculated as follows:
Calculating the y-values of the Lorenz curve of the actual distribution:
y_{0} = 0,000 y_{1} = v_{1} = 0,025 y_{2} = v_{1} + v_{2} = 0,500 y_{3} = v_{1} + v_{2} + v_{3} = 0,770 y_{4} = v_{1} + v_{2} + v_{3} + v_{4} = 1,000
Calculation of the area B under the Lorenz curve of the actual distribution (see below):
(y_{1} - 0,5 · v_{1}) · b_{1} = 0,00625 (y_{2} - 0,5 · v_{2}) · b_{2} = 0,105 (y_{3} - 0,5 · v_{3}) · b_{3} = 0,05715 (y_{4} - 0,5 · v_{4}) · b_{4} = 0,00885
B = 0,17725
Since a standardized representation is used, the curve of the total uniform distribution connects the corner points (0 | 0) and (1 | 1) with one another. The triangle with area A is therefore 0.5. That is why the following applies to the Gini inequality coefficient:
Viewed graphically, the Gini coefficient is the ratio of the area between the uniform distribution line and the Lorenz curve (AB) to the area below the uniform distribution line (A).
Explanation of the calculation
The entire Gini area is a rectangle with the sides times . The Gini area of an even distribution is half of the total Gini area. To calculate the area under the curve, all individual areas are added. Take for example . The rectangle with the height and the width (ie from to ) is fully taken into account . Only half of the rectangle that goes from height to height is to be taken, as the other half above the Gini line does not belong to the Gini area. So is
or
Alternative view of the area calculation: The individual area is the difference between the rectangular area, which is determined by the points (x _{1} , y _{0} = 0), (x _{2} , y _{0} = 0), (x _{2} , y _{2} ), (x _{1} , y _{1} ) is limited (content:) , minus the area of the right-angled triangle bounded by the points (x _{1} , y _{1} ), (x _{2} , y _{1} ), (x _{1} , y _{2} ) (content: ), with the same result. _{}_{}_{}_{}_{}_{}_{}_{}_{}_{}_{}_{}_{}_{}
Data reduction
The Gini coefficient is a statistical measure used to calculate the distribution of inequality. Such measures generally reduce a more or less complex data set to a simple key figure. This metric can lead to misinterpretation if it is not used properly.
In the case of the Gini coefficient, for example, there is at least one other Lorenz curve with exactly the same Gini value for almost every Lorenz curve . This is obtained by mirroring the original Lorenz curve on the line that runs through points (0 | 1) and (1 | 0). If the quantities 10% / 90% are to be distributed over 50% / 50%, this results in the same Lorenz curve as the distribution of the quantities from 50% / 50% to 90% / 10% of the feature carriers. These two Lorenz curves are shown in Figure 1. The only exceptions are Lorenz curves, which are symmetrical to this line from the start.
A common Gini coefficient of 0.4 results for the two different curves. In fact, there is even an infinite number of possible Lorenz curves for a Gini coefficient (except in the case of absolute equal or absolute unequal distribution). At this point, the Gini coefficient is the same as any other measure derived from accumulating a large amount of data. Unequal distribution indicators such as the Gini coefficient arise from the aggregation of data with the aim of reducing complexity. The associated loss of information is therefore not an unintended side effect. When it comes to reducing complexity, it is generally true that they only become a disadvantage if one forgets their creation and their mapping function.
Source of error in comparisons
Statements in which inequality coefficients are compared with one another require a particularly critical review of the calculation of the individual coefficients. For a correct comparison it is necessary that these coefficients have been calculated uniformly in all cases. For example, the different granularity of the input data leads to different results when calculating the uneven distribution. A Gini coefficient calculated with a few quantiles usually shows a slightly lower unequal distribution than a coefficient calculated with more quantiles, because in the latter case, thanks to the higher measurement resolution, the unequal distribution can be taken into account within the ranges (i.e. between the quantiles) in the first Case remains unevaluated because of the coarser measurement resolution.
In simple terms: a higher resolution of the data (almost always) provides a lower uniform distribution.
See also
- List of countries according to income distribution
- List of countries by wealth distribution
- Theil index
- Hoover inequality
Web links
- Travis Hale, University of Texas Inequality Project: The Theoretical Basics of Popular Inequality Measures (theory with practical examples; MS Word ; 1.6 MB), example 1B
- Calculator: online and downloadable scripts and macros (for Python , Lua and OpenOffice.org 2.0 Calc)
- E-learning video: Lorenz curve and Gini coefficient
- World Income Inequality Database of the United Nations University
Individual evidence
- ↑ Eurostat website ( Memento of the original from December 4, 2016 in the Internet Archive ) Info: The archive link has been inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice.
- Jump up ↑ Breiman, L. and Friedman, JH and Olshen, RA and Stone, CJ: Classification and regression trees . Chapman and Hall, New York 1984.
- ↑ Series of guidelines on credit risk: Rating models and validation, Austrian National Bank and Financial Market Authority, 2004. Archive link ( Memento from December 4, 2011 in the Internet Archive )
- ^ PJ Lambert (2001): The Distribution and Redistribution of Income. Manchester University Press, pp. 31ff.
- ^ Ochmann, R. and A. Peichl (2006): Measuring Distributional Effects of Fiscal Reforms. Financial scientific discussion contributions No. 06-9 , financial scientific research institute at the University of Cologne.
- ↑ On-line calculator: uneven distribution
- ↑ Comparison: www.umversorgung.de/rechner/?quantiles=50,10|50.90 (blue curve) and www.umversorgung.de/rechner/?quantiles=90.50|10.50 (red curve)