# Runaway

An outlier reading. The blue regression line was created without including the outlier, the purple one with.
The box plot is shown over a number line.

In statistics , one speaks of an outlier when a measured value or finding does not fit into an expected series of measurements or does not generally meet expectations. The "expectation" is usually defined as the range of variation around the expected value in which most of all measured values ​​come to lie, e.g. B. the interquartile range Q 75 - Q 25 . Values ​​that are more than 1.5 times the interquartile range outside this interval are (mostly arbitrarily) referred to as outliers. In the box plot particularly high outliers are presented separately. The robust statistics deal with the outlier problem. In data mining , too , one deals with the detection of outliers. Influential observations have to be distinguished from outliers .

## Check for measurement errors

It is then crucial to check whether the outlier is actually a reliable and genuine result, or whether there is a measurement error .

Example: The ozone hole over the Antarctic has already been measured for a number of years, but the measured values were assessed as obviously incorrectly measured (ie interpreted as “outliers” and ignored) and their scope was therefore not recognized.

### Outlier tests

Another approach has been u. a. proposed by Ferguson in 1961. It is then assumed that the observations come from a hypothetical distribution . Outliers are then observations that do not come from the hypothetical distribution. The following outlier tests all assume that the hypothetical distribution is a normal distribution and check whether one or more of the extreme values ​​do not come from the normal distribution:

The Walsh outlier test , however, is not based on the assumption of a specific distribution of the data. As part of the time series analysis , time series for which an outlier is suspected can be tested for it and then modeled with an outlier model.

### Differences to extreme values

A popular approach is to use the box plot to identify “outliers”. The observations outside the whiskers are arbitrarily referred to as outliers. For the normal distribution one can easily calculate that almost 0.7% of the mass of the distribution lies outside the whiskers. From a sample size of from one would expect (on average) at least one observation outside of the whiskers (or observations outside of the whiskers for ). It therefore makes more sense initially to speak of extreme values ​​instead of outliers. ${\ displaystyle n> 143}$${\ displaystyle k}$${\ displaystyle n> 143.3362 * k}$

## Multivariate outliers

Outliers in the bottom right of the scatter plot and box plots for each individual variable.

The situation becomes even more complicated in several dimensions. In the graphic on the right, the outlier in the lower right corner cannot be identified by inspecting every single variable; it is not visible in the box plots. Nevertheless, it will significantly influence a linear regression .

### Andrews curves

Andrew's curves with differently colored data.

Andrews (1972) suggested that each multivariate observation be represented by a curve: ${\ displaystyle (x_ {i1}, x_ {i2}, \ dotsc, x_ {ip})}$

${\ displaystyle f_ {i} (t) = {\ frac {x_ {i1}} {\ sqrt {2}}} + x_ {i2} \ sin (t) + x_ {i3} \ cos (t) + x_ {i4} \ sin (2t) + x_ {i5} \ cos (2t) + \ dotsb}$

This means that every multivariate observation is mapped onto a two-dimensional curve in the interval . Because of the sine and cosine terms, the function repeats itself outside the interval . ${\ displaystyle [- \ pi; \ pi]}$${\ displaystyle f_ {i} (t)}$${\ displaystyle [- \ pi; \ pi]}$

For every two observations and the following applies: ${\ displaystyle i}$${\ displaystyle j}$

${\ displaystyle \ underbrace {\ int _ {- \ pi} ^ {\ pi} \ left (f_ {i} (t) -f_ {j} (t) \ right) ^ {2} dt} _ {(1 )} = \ underbrace {\ pi \ sum _ {k = 1} ^ {p} (x_ {ik} -x_ {jk}) ^ {2}} _ {(2)}}$

The formula (1) to the left of the equal sign corresponds (at least approximately) to the area between the two curves, and the formula (2) to the right is (at least approximately) the multivariate Euclidean distance between the two data points.

So if the distance between two data points is small, then the area between the curves must also be small; i.e., the curves and must be close together. However, if the distance between two data points is large, the area between the curves must also be large; i.e., the curves and must run very differently. A multivariate outlier would be visible as a curve that clearly differs in its course from all other curves. ${\ displaystyle f_ {i} (t)}$${\ displaystyle f_ {j} (t)}$${\ displaystyle f_ {i} (t)}$${\ displaystyle f_ {j} (t)}$

• If the outlier is visible in exactly one variable, the further ahead this variable appears, the better the person perceives the different curves. Ideally, it should be the variable . In other words, it makes sense to sort the variables, e.g. B. becomes the variable with the greatest variance, or one takes the first principal component .${\ displaystyle x _ {\ bullet 1}}$${\ displaystyle x _ {\ bullet 1}}$
• If one has many observations, many curves have to be drawn so that the course of a single curve is no longer visible.

### Stahel-Donoho Outlyingness

Stahel (1981) and David Leigh Donoho (1982) defined the so-called outlyingness in order to obtain a measure of how far an observation value is from the bulk of the data. By calculating all possible linear combinations , i. H. the projection of the data point onto the vector , with the outlyingness ${\ displaystyle \ alpha _ {1} x_ {i1} + \ alpha _ {2} x_ {i2} + \ dotsb + \ alpha _ {p} x_ {ip} = \ alpha ^ {T} x_ {i}}$${\ displaystyle \ alpha}$${\ displaystyle \ sum _ {k = 1} ^ {p} \ alpha _ {i} ^ {2} = 1}$

${\ displaystyle \ operatorname {out} (x_ {i}) = \ sup _ {\ alpha} \ left (\ operatorname {out} (x_ {i}, \ alpha) \ right) = \ sup _ {\ alpha} \ left ({\ frac {\ alpha ^ {T} x_ {i} - \ operatorname {median} (\ alpha ^ {T} x)} {\ operatorname {mad} (\ alpha ^ {T} x)}} \ right)}$,

where the median of the projected points and the mean absolute deviation of the projected points as a robust measure of dispersion. The median serves as a robust position, the mean absolute deviation as a robust measure of dispersion. is a normalization. ${\ displaystyle \ operatorname {median} (\ alpha ^ {T} x)}$${\ displaystyle \ operatorname {mad} (\ alpha ^ {T} x)}$${\ displaystyle \ operatorname {out} (x_ {i}, \ alpha)}$

In practice, the Outlyingness is calculated by determining the maximum for several hundred or thousands of randomly selected projection directions . ${\ displaystyle \ alpha}$${\ displaystyle \ operatorname {out} (x_ {i}, \ alpha)}$

## Outlier detection in data mining

The English term outlier detection refers to the sub-area of data mining , which is about identifying atypical and conspicuous data sets. Application for this is, for example, the detection of (potentially) fraudulent credit card transactions in the large number of valid transactions. The first algorithms for outlier detection were closely based on the statistical models mentioned here, but due to calculation and, above all, runtime considerations, the algorithms have since moved away from them. An important method for this is the density-based local outlier factor .

## literature

• R. Khattree, DN Naik: Andrews Plots for Multivariate Data: Some New Suggestions and Applications . In: Journal of Statistical Planning and Inference . tape 100 , no. 2 , 2002, p. 411-425 , doi : 10.1016 / S0378-3758 (01) 00150-1 .