# Gini coefficient Lorenz curve (red) of the real distribution for calculating the Gini coefficient and ideal uniform distribution (black)

The Gini coefficient or Gini index is a statistical measure that was developed by the Italian statistician Corrado Gini to represent inequalities . Inequality coefficients of distribution can be calculated for any distribution. For example, the Gini coefficient is used in economics, but also in geography, as a yardstick for the distribution of income and wealth in individual countries and thus as an aid to classifying countries and their associated level of development.

The Gini coefficient is derived from the Lorenz curve and has a value between 0 (with an even distribution) and 1 (when only one person receives the complete income, i.e. with maximum unequal distribution). With a uniform distribution is not uniform distribution meant in the probabilistic sense, but a distribution with a variance of 0. In the most common use case, the distribution of income in a country, it means that the income of each adult is the same, and not that different income (classes) are equally frequent.

## Applications

### Economy

The Gini coefficient is used in particular in welfare economics to describe, for example, the degree of equality or inequality in the distribution of wealth or income. The coefficient is an alternative to the S80 / S20 income quintile ratio, which is used in EU statistics.

### Information theory

In information theory , it is used as a measure of the "purity" or "impurity" of information.

### Machine learning

In the area of machine learning , when generating a decision tree, the Gini index, or more precisely the change in the Gini index, also called "Gini Gain", can be used as a criterion to select the decision rule in which the child nodes are as "pure" as possible . The idea is that with a "pure" decision the tree is ready, which is why changing the Gini index is suitable as a measure.

### Banking

In banking , the Gini coefficient is used as a measure of how well a rating system can separate good from bad customers ( selectivity ).

## Normalization

The scale of possible values ​​ranges from 0 to 1, from 0 to 100, from 0 to 10000, depending on the application. Depending on the application, the smallest or the largest value stands for even distribution. The value of absolute inequality can generally only be reached asymptotically. This can be avoided by renormalizing.

## definition

### General case

For an ascendingly sorted, discretely distributed quantity (example: household income) the Lorenz curve is given by ${\ displaystyle x = x_ {1}, \ ldots, x_ {n}}$ ${\ displaystyle L (y, n)}$ ${\ displaystyle L (j, n) = \ sum _ {i = 1} ^ {j} {\ frac {x_ {i}} {\ sum _ {k = 1} ^ {n} x_ {k}}} = \ sum _ {i = 1} ^ {j} {\ frac {x_ {i}} {n \ mu}} = \ sum _ {i = 1} ^ {j} l_ {i}.}$ For the position in the income distribution, the Lorenz curve therefore indicates the cumulative share of total income. denotes the arithmetic mean. With an even distribution, the area between the 45 degree line and the Lorenz curve would correspond to the value 0 and increase for more unequal distributions. From this consideration and the goal of obtaining a measure normalized to the interval , the Gini inequality coefficient results as By geometrically decomposing the area one obtains: ${\ displaystyle j}$ ${\ displaystyle \ mu = {\ frac {\ sum _ {i = 1} ^ {n} x_ {i}} {n}}}$ ${\ displaystyle A}$ ${\ displaystyle [0; 1]}$ ${\ displaystyle \ mathrm {GUK} = 2A.}$ ${\ displaystyle A}$ ${\ displaystyle A = {\ frac {1} {2}} \ sum _ {i = 1} ^ {n} l_ {i} \ left ({\ frac {i-1} {n}} + {\ frac {i} {n}} \ right) - {\ frac {1} {2}} = \ sum _ {i = 1} ^ {n} l_ {i} {\ frac {2i-n-1} {2n }}.}$ For a real distribution, one can calculate the Gini coefficient directly as follows (using ): ${\ displaystyle l_ {i} = x_ {i} / (n \ mu)}$ ${\ displaystyle GUK = 2A = \ sum _ {i = 1} ^ {n} l_ {i} {\ frac {2i-n-1} {n}} = {\ frac {1} {n ^ {2} \ mu}} \ sum _ {i = 1} ^ {n} x_ {i} (2i-n-1).}$ An alternative formulation that does not require the data to be sorted is based on the so-called relative mean absolute difference . The mean absolute difference denotes the mean difference of all pairs of observations present in a population. This is set in relation to the average income. So that the Gini coefficient assumes the desired range of values, the difference is divided by 2:

${\ displaystyle GUK = {\ frac {\ sum _ {i = 1} ^ {n} \ sum _ {j = 1} ^ {n} | x_ {i} -x_ {j} |} {2n \ sum _ {i = 1} ^ {n} x_ {i}}}}$ ### Calculation based on quantiles

A certain part of a set A is assigned to a part of another set B. This can e.g. B. Money (A) on people (B) or electricity consumption (A) on cities (B). It is crucial that A represents a homogeneous, easily divisible set. For example, owning a motor vehicle would not be suitable because motor vehicles are neither homogeneous - individual types differ considerably - nor can they be divided into small units.

The Gini coefficient is the area normalized to the uniform distribution between the Lorenz curves of a uniform distribution and the observed distribution.

${\ displaystyle \ mathrm {GUK} = {\ frac {A_ {g} -A_ {ug}} {A_ {g}}}}$ with GUK as the Gini inequality coefficient , the area under the Lorenz curve of a uniform distribution, and the area under the Lorenz curve for the observed distribution. ${\ displaystyle A_ {g}}$ ${\ displaystyle A_ {ug}}$ #### example

A is distributed to B, for example the wealth (A) is distributed to the population (B).

50 Prozent von B (b1) wird  2,5 Prozent von A zugeordnet (v1).
40 Prozent von B (b2) wird 47,5 Prozent von A zugeordnet (v2).
9 Prozent von B (b3) wird 27,0 Prozent von A zugeordnet (v3).
1 Prozent von B (b4) wird 23,0 Prozent von A zugeordnet (v4).


In a first step, the data are displayed "normalized":

b1 = 0,50     v1 = 0,025          v1/b1 =  0,05
b2 = 0,40     v2 = 0,475          v2/b2 =  1,188
b3 = 0,09     v3 = 0,270          v3/b3 =  3
b4 = 0,01     v4 = 0,230          v4/b4 = 23


In the second step, the Gini coefficient is calculated.

The Gini unequal distribution coefficient (GUK) is obtained by evaluating a Lorenz curve .

In order to actually produce a Lorenz curve, the above values ​​may have to be rearranged. All value pairs must first be pre-sorted in such a way that: ${\ displaystyle (v_ {i}, b_ {i})}$ ${\ displaystyle {\ frac {v_ {i}} {b_ {i}}} \ geq {\ frac {v_ {i-1}} {b_ {i-1}}}}$ In the above example, the correct sorting is already in place, so that there is no need to re-sort.

The Lorenz curve you are looking for arises when you enter (x i , y i ) pairs as points in a Cartesian coordinate system and then connect neighboring points with a straight line. The -pairs result from the -pairs according to the following calculation rule: ${\ displaystyle (x_ {i}, y_ {i})}$ ${\ displaystyle (v_ {i}, b_ {i})}$ ${\ displaystyle x_ {n} = \ sum _ {j = 1} ^ {n} b_ {j} \ quad {\ text {and}} \ quad y_ {n} = \ sum _ {j = 1} ^ { n} v_ {j}.}$ In the second step, the following data is determined from the data of the first step by summation (with (0, 0) added as a fixed value at the beginning):

x0 = 0,00     y0 = 0
x1 = 0,50     y1 = 0,025
x2 = 0,90     y2 = 0,5    (da 0,5 + 0,4 = 0,9 und 0,025 + 0,475 = 0,5 ist)
x3 = 0,99     y3 = 0,77
x4 = 1,00     y4 = 1


With total equal distribution of wealth , the Lorenz curve is a straight line from point (0 | 0) to point (1 | 1).

To determine the Gini coefficient, two quantities are first determined, which are graphically considered areas. Once the area under the uniform distribution line, let's call this quantity A. The second area is the area under the actual distribution curve , let's call this quantity B. With these two quantities, the Gini inequality coefficient is calculated as follows:

${\ displaystyle \ mathrm {GUK} = {\ frac {AB} {A}}}$ Calculating the y-values ​​of the Lorenz curve of the actual distribution:

y0 = 0,000
y1 = v1 = 0,025
y2 = v1 + v2 = 0,500
y3 = v1 + v2 + v3 = 0,770
y4 = v1 + v2 + v3 + v4 = 1,000


Calculation of the area B under the Lorenz curve of the actual distribution (see below):

(y1 - 0,5 · v1) · b1 = 0,00625
(y2 - 0,5 · v2) · b2 = 0,105
(y3 - 0,5 · v3) · b3 = 0,05715
(y4 - 0,5 · v4) · b4 = 0,00885

B = 0,17725


Since a standardized representation is used, the curve of the total uniform distribution connects the corner points (0 | 0) and (1 | 1) with one another. The triangle with area A is therefore 0.5. That is why the following applies to the Gini inequality coefficient:

${\ displaystyle \ mathrm {GUK} = {\ frac {AB} {A}} = {\ frac {0 {,} 5-B} {0 {,} 5}} = 1-2 \ cdot B = 1- 0 {,} 3545 = 0 {,} 6455}$ Viewed graphically, the Gini coefficient is the ratio of the area between the uniform distribution line and the Lorenz curve (AB) to the area below the uniform distribution line (A).

Explanation of the calculation

The entire Gini area is a rectangle with the sides times . The Gini area of ​​an even distribution is half of the total Gini area. To calculate the area under the curve, all individual areas are added. Take for example . The rectangle with the height and the width (ie from to ) is fully taken into account . Only half of the rectangle that goes from height to height is to be taken, as the other half above the Gini line does not belong to the Gini area. So is ${\ displaystyle v_ {1} + v_ {2} + v_ {3} + v_ {4}}$ ${\ displaystyle b_ {1} + b_ {2} + b_ {3} + b_ {4}}$ ${\ displaystyle b_ {2}}$ ${\ displaystyle y_ {1}}$ ${\ displaystyle b_ {2}}$ ${\ displaystyle x_ {1}}$ ${\ displaystyle x_ {2}}$ ${\ displaystyle y_ {1}}$ ${\ displaystyle y_ {2}}$ ${\ displaystyle {\ text {area}} = y_ {1} \ cdot b_ {2} + {\ frac {(y_ {2} -y_ {1}) \ cdot b_ {2}} {2}} = { \ frac {(y_ {2} + y_ {1}) \ cdot b_ {2}} {2}}}$ or

${\ displaystyle {\ text {area}} = \ left (y_ {2} - {\ frac {v_ {2}} {2}} \ right) \ cdot b_ {2}.}$ Alternative view of the area calculation: The individual area is the difference between the rectangular area, which is determined by the points (x 1 , y 0 = 0), (x 2 , y 0 = 0), (x 2 , y 2 ), (x 1 , y 1 ) is limited (content:) , minus the area of ​​the right-angled triangle bounded by the points (x 1 , y 1 ), (x 2 , y 1 ), (x 1 , y 2 ) (content: ), with the same result. ${\ displaystyle b_ {2}}$ ${\ displaystyle b_ {2} \ cdot y_ {2}}$ ${\ displaystyle {\ tfrac {b_ {2} \ cdot v_ {2}} {2}}}$ ## Data reduction

The Gini coefficient is a statistical measure used to calculate the distribution of inequality. Such measures generally reduce a more or less complex data set to a simple key figure. This metric can lead to misinterpretation if it is not used properly.

In the case of the Gini coefficient, for example, there is at least one other Lorenz curve with exactly the same Gini value for almost every Lorenz curve . This is obtained by mirroring the original Lorenz curve on the line that runs through points (0 | 1) and (1 | 0). If the quantities 10% / 90% are to be distributed over 50% / 50%, this results in the same Lorenz curve as the distribution of the quantities from 50% / 50% to 90% / 10% of the feature carriers. These two Lorenz curves are shown in Figure 1. The only exceptions are Lorenz curves, which are symmetrical to this line from the start.

A common Gini coefficient of 0.4 results for the two different curves. In fact, there is even an infinite number of possible Lorenz curves for a Gini coefficient (except in the case of absolute equal or absolute unequal distribution). At this point, the Gini coefficient is the same as any other measure derived from accumulating a large amount of data. Unequal distribution indicators such as the Gini coefficient arise from the aggregation of data with the aim of reducing complexity. The associated loss of information is therefore not an unintended side effect. When it comes to reducing complexity, it is generally true that they only become a disadvantage if one forgets their creation and their mapping function.

## Source of error in comparisons

Statements in which inequality coefficients are compared with one another require a particularly critical review of the calculation of the individual coefficients. For a correct comparison it is necessary that these coefficients have been calculated uniformly in all cases. For example, the different granularity of the input data leads to different results when calculating the uneven distribution. A Gini coefficient calculated with a few quantiles usually shows a slightly lower unequal distribution than a coefficient calculated with more quantiles, because in the latter case, thanks to the higher measurement resolution, the unequal distribution can be taken into account within the ranges (i.e. between the quantiles) in the first Case remains unevaluated because of the coarser measurement resolution.

In simple terms: a higher resolution of the data (almost always) provides a lower uniform distribution.