Anscombe Quartet

from Wikipedia, the free encyclopedia
These four sets of points look different, but have almost the same simple statistical measures

The Anscombe Quartet consists of four sets of data points that have almost identical simple statistical properties, but look very different when plotted. Each of these four sets consists of eleven ( x , y ) points. These four sets were constructed by the English statistician Francis Anscombe in 1973 to highlight the importance of graphical data analysis and to demonstrate the effects of outliers .

presentation

The following applies to the four point sets:

property value
Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (on 2 digits)
Variance of y in each case 4,122 or 4,127 (to 3 digits)
Correlation between x and y in each case 0.816 (on 3 digits)
Linear regression in every case y  = 3.00 + 0.500 x (on 2 or 3 places)

The first scatter plot (top left) seems to suggest a simple linear relationship, the two variables appear to be correlated. The second scatter plot (top right) shows a relationship between the variables, but this is obviously not linear. In the third scatter plot (bottom left) there appears to be a linear relationship, but there is an outlier . The fourth scatter plot (bottom right) also shows an outlier, while the other data points are all on top of each other (same x value). If the Bravais-Pearson correlation coefficient (as a measure of the linear relationship) is calculated, the value 0.816 results for all four data sets. However, the relationship is only correctly described for the upper left scatter diagram.

The Anscombe Quartet is used to highlight the importance of graphical data analysis, which should be done before beginning analysis based on an assumption about the statistical properties of the data. It also shows that simple statistical measures are not always sufficient to describe the data.

The four sets of data points are summarized in the table below. The x values ​​are the same for the first three sets.

The Anscombe Quartet
I. II III IV
x y x y x y x y
4.0 4.26 4.0 3.10 4.0 5.39 8.0 5.25
5.0 5.68 5.0 4.74 5.0 5.73 8.0 5.56
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.76
7.0 4.82 7.0 7.26 7.0 6.42 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 6.89
9.0 8.81 9.0 8.77 9.0 7.11 8.0 7.04
10.0 8.04 10.0 9.14 10.0 7.46 8.0 7.71
11.0 8.33 11.0 9.26 11.0 7.81 8.0 7.91
12.0 10.84 12.0 9.13 12.0 8.15 8.0 8.47
13.0 7.58 13.0 8.74 13.0 12.74 8.0 8.84
14.0 9.96 14.0 8.10 14.0 8.84 19.0 12.50

With the help of evolutionary algorithms , data sets whose most important statistical key figures are identical but which show completely different properties in graphical representation can now be generated automatically.

See also

Web links

Individual evidence

  1. ^ A b F. J. Anscombe: Graphs in Statistical Analysis . In: American Statistician . 27, No. 1, 1973, pp. 17-21.
  2. ^ Glenn Elert: Linear Regression . In: The Physics Hypertextbook . Retrieved April 26, 2013.
  3. ^ Philipp K. Janert: Data Analysis with Open Source Tools . O'Reilly Media, Inc., 2010, ISBN 0-596-80235-8 , pp. 65-66.
  4. Samprit Chatterjee, Ali S. Hadi: Regression analysis by example . John Wiley and Sons, 2006, ISBN 0-471-74696-7 , p. 91.
  5. David J. Saville, Graham R. Wood: Statistical methods: the geometric approach . Springer , 1991, ISBN 0-387-97517-9 , p. 418.
  6. ^ Edward R. Tufte : The Visual Display of Quantitative Information , 2nd. Edition, Graphics Press, Cheshire, CT 2001, ISBN 0-9613921-4-2 .
  7. Sangeet Chatterjee, Aykut Firat: Generating Data with Identical Statistics but Dissimilar Graphics: A follow-up to the Anscombe dataset . In: American Statistician . 61, No. 3, 2007, pp. 248-254. doi : 10.1198 / 000313007X220057 .