Box plot

from Wikipedia, the free encyclopedia
A horizontal box plot over a number line

The box plot (also called box whisker plot or German box graphic ) is a diagram that is used to graphically display the distribution of a feature that is at least ordinally scaled . It combines various robust scatter and position measures in one representation. A box plot should quickly give an impression of the area in which the data are located and how they are distributed over this area. Therefore, all values ​​of the so-called five-point summary, i.e. the median , the two quartiles and the two extreme values, are shown.

construction

A box plot always consists of a rectangle called a box and two lines that extend this rectangle. These lines are called "antenna" or, more rarely, "feelers" or "whiskers" and are terminated by a line. Usually the line in the box represents the median of the distribution.

box

The box corresponds to the area in which the middle 50% of the data lies. It is thus limited by the upper and the lower quartile and the length of the box corresponds to the interquartile range ( English interquartile range , IQR). This is a measure of the spread of the data, which is determined by the difference between the upper and lower quartile. Furthermore, the median is shown as a continuous line in the box. This line divides the entire chart into two areas, each with 50% of the data. Its position within the box gives you an impression of the skewness of the distribution on which the data is based . If the median is in the left part of the box, the distribution is skewed to the right, and vice versa.

Antenna (whisker)

Box plot with whiskers measuring 1.5 × IQR
Box plot of the same data with whiskers from the minimum to the maximum of the data

The antennas show the values ​​outside the box. In contrast to the definition of the box, the definition of the antennas is not uniform.

One possible definition, given by John W. Tukey , is to restrict the length of the whiskers to a maximum of 1.5 times the interquartile range (1.5 × IQR). The whisker does not end exactly after this length, but at the value from the data that is still within this limit. The length of the whiskers is therefore determined by the data values ​​and not solely by the interquartile range. This is also the reason why the whiskers do not have to be of the same length on both sides. If there are no values ​​outside the limit of 1.5 × IQR, the length of the whisker is determined by the maximum and minimum values. Otherwise, the values ​​outside the whiskers are entered separately in the diagram. These values ​​can then be treated as suspected outliers or are directly referred to as outliers.

Outliers between 1.5 × IQR and 3 × IQR are often referred to as "mild" outliers and values ​​above 3 × IQR are referred to as "extreme" outliers. These are then usually marked differently in the diagram.

Another possible definition is that the whiskers extend to the largest or smallest value in the data. In this representation, outliers can no longer be recognized, since the box including the whisker covers the entire range of the data.

In another variant, the lower whisker is calculated as a 2.5% quantile and the upper one as a 97.5% quantile. Thus 95% of all observed values ​​lie within the whisker limits. In this representation (depending on the quantile definition) from a certain sample size onwards, there are always individually displayed points (which should then not automatically be interpreted as outliers).

Modifications

Notched box plot representing the size of each state in the United States.

A modification is to enter the arithmetic mean in a box plot. It is usually entered as a star. Since the box plot otherwise only contains robust measures of dispersion and position, the arithmetic mean should not actually be included in a box plot as a non-robust measure of position.

(Engl. In notched notched ) box plot also confidence intervals recorded for the median.

Summary of the characteristic values

The advantage of a box plot is that certain characteristic values ​​of a distribution can be read directly from the graphical representation.

Characteristic value description Location in the box plot
minimum Smallest data value of the data set End of a whisker or the most distant outlier
Lower quartile The smallest 25% of the data values ​​are smaller than or equal to this characteristic value Beginning of the box
Median The smallest 50% of the data values ​​are smaller than or equal to this characteristic value Line inside the box
Upper quartile The smallest 75% of the data values ​​are smaller than or equal to this characteristic value End of the box
maximum Largest data value of the data set End of a whisker or the most distant outlier
span Entire range of values ​​of the data record Length of the entire box plot (including outliers)
Interquartile range Range of values ​​in which the middle 50% of the data is located. (Between the 0.25 and 0.75 quartiles.) Expansion of the box

application

Due to the simple structure of box plots, these are mainly used when you want to get a quick overview of existing data. It does not have to be known to which distribution this data is subject. The box indicates in which area 50% of the data lies, and the box including whiskers indicates in which area the majority of the data lies. The position of the median within this box shows whether a distribution is symmetrical or skewed. The box plot is less suitable for bi- or multimodal distributions . To uncover such properties, the use of histograms or the graphical implementation of kernel density estimates is recommended .

Box plots with whiskers of at most one and a half times the interquartile range are also suitable for identifying possible outliers or provide information on whether the data are subject to a certain distribution. If the box plot is highly asymmetrical, contains an unusually high number of outliers, or outliers far from the box, this indicates, for example, that the data is not normally distributed .

The main advantage of the box plot is the quick comparison of the distribution in different subgroups. While a histogram has a two-dimensional extent, a box plot is essentially one-dimensional, so that several data sets can easily be displayed and compared next to one another (or one below the other if displayed horizontally) on the same scale.

example

Example of a box plot

This example is based on a series of measurements with the following 20 data points:

1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th 20th
(unsorted) 9 6th 7th 7th 3 9 10 1 8th 7th 9 9 8th 10 5 10 10 9 10 8th
(sorted) 1 3 5 6th 7th 7th 7th 8th 8th 8th 9 9 9 9 9 10 10 10 10 10

A box plot helps to get an overview of this data very quickly. You can see straight away that the median (solid line) is exactly 8.5 and that 25% of the data is below 7 and above 9.5, because these are exactly the dimensions of the box that contains 50% of the measured values are. Consequently, the interquartile range, which corresponds to the length of the box, is exactly 2.5.

This box plot was created with whiskers up to a length of 1.5 times the interquartile range. These are therefore a maximum of 3.75 units long. However, whiskers only ever reach a value from the data that is still within these 3.75 units. So the upper whisker only runs up to 10 because there is no larger value in the data, and the lower whisker only runs up to 5 because the next lower value is further than 3.75 from the beginning of the box.

The values ​​of 1 and 3 are marked as outliers in the box plot because they are not within the box or the whiskers. With these values, it should be examined whether they are actually outliers, typographical errors or otherwise conspicuous values.

Since the median is slightly to the right within the box, it can also be concluded that the underlying distribution of the measurement data is skewed to the left. In addition, this distribution will probably not be a normal distribution, since the box plot is asymmetrical and contains a comparatively large number of outliers.

See also

  • Variation fan , circular graph that shows the same information about variation as a box plot.

literature

  • John W. Tukey : Exploratory data analysis. Addison-Wesley 1977, ISBN 0-201-07616-0 .
  • Falk et al .: Foundations of statistical analysis and applications with SAS. Birkhäuser, 2002.

Web links

Wikibooks: Section on box plots  - learning and teaching materials
Wiktionary: Boxplot  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. ^ Franz Kronthaler: Statistics applied. Data analysis is (not) an art. Springer-Verlag, Berlin Heidelberg 2014, ISBN 978-3-642-53739-4 , p. 38.
  2. ^ Karl Mosler, Friedrich Schmid: Descriptive statistics and economic statistics . 3. Edition. Springer-Verlag, Berlin / Heidelberg 2006, ISBN 978-3-540-37459-6 , p. 33.
  3. "Simple box plot - the distribution of a feature that is at least ordinally scaled is shown" Quoted from the statistics glossary