G test

In statistics, the G -test is used to check whether the frequencies in a contingency table came about by chance or not. The G test replaces the older chi-square test in many areas, especially in computational linguistics .

As with the chi-square test, you divide the characteristics of the characteristic into categories and count how often the characteristic falls into each of these categories. ${\ displaystyle X}$ ${\ displaystyle m}$

The formula for calculating the test statistic G is as follows:

{\ displaystyle G = 2 \ sum _ {i = 1} ^ {m} {N_ {i} \ cdot \ ln \ left ({\ frac {N_ {i}} {n_ {0i}}} \ right)} }

${\ displaystyle N_ {i}}$ is the observed frequency with which the trait falls into the -th category, is the expected frequency of the same cell, assuming the null hypothesis , and is the natural logarithm . The sum symbol adds the results for all categories. The test statistic is approximately chi-square distributed with degrees of freedom . ${\ displaystyle i}$ ${\ displaystyle n_ {0i}}$ ${\ displaystyle \ ln}$ ${\ displaystyle m}$ ${\ displaystyle G}$ ${\ displaystyle m-1}$

Comparison with the chi-square test

Both tests solve the same statistical problem, but the chi-square test has a squaring as the most complex calculation step, while the G test calculates the logarithm. The chi-square test owes its popularity to the simple calculation that can easily be carried out by hand with small contingency tables. In addition, the chi-square test has always been covered in basic statistics textbooks.

The rule of thumb for chi-square tests is that the frequency value per cell must be at least 5. The G test is more robust with small samples.

literature

arxiv : 1206.4881 [abs]