Local outlier factor

The Local Outlier Factor ( LOF ) is an algorithm for the detection of density-based outliers . It was proposed by Markus M. Breunig, Hans-Peter Kriegel , Raymond T. Ng and Jörg Sander in 2000. The core idea of LOF is to compare the density of a point with the densities of its neighbors. A point that is “denser” than its neighbors is in a cluster. A point with a significantly lower density than its neighbors, on the other hand, is an outlier.

LOF has many concepts in common with the cluster analysis algorithms DBSCAN and OPTICS .

Basic principle

The core idea of LOF: compare the local density of a point with that of its neighbors.

LOF defines the “local environment” of a point via its closest neighbors. The distance to these is used to estimate a local density . In a second step, the quotient is formed from the local densities of its neighbors and the local density of the point itself. This value moves close to when a point is in an area of uniform density. For objects that are isolated from such an area, the value is significantly higher and thus indicates outliers. ${\ displaystyle k}$ ${\ displaystyle 1.0}$

Formally

This is the distance between the object and its k -nearest neighbor. This set can contain more than k objects if there are several objects with the same distance. We denote this “ k- neighborhood” here . ${\ displaystyle {\ mbox {k-distance}} (A)}$ ${\ displaystyle A}$ ${\ displaystyle N_ {k} (A)}$

Illustration of the reachability distance. Objects B and C have the same reachable distance (k = 3), while D is not a k nearest neighbor

The "reachability distance" is defined based on this distance:

${\ displaystyle {\ mbox {Accessibility distance}} _ {k} (A, B) = \ max \ {{\ mbox {k-distance}} (B), d (A, B) \}}$

The reachable distance of an object from is either the true distance, but at least that of . Objects that belong to the k -nearest neighbors of are therefore considered to be equidistant. The motivation for this distance definition is to get more stable results. (However, this is no longer a distance function in the mathematical sense, since it is not symmetrical.) ${\ displaystyle A}$ ${\ displaystyle B}$ ${\ displaystyle {\ mbox {k-distance}}}$ ${\ displaystyle B}$ ${\ displaystyle B}$

The local accessibility density ( " local reachability density ", " lrd ") of an object is defined as ${\ displaystyle A}$

${\ displaystyle {\ mbox {lrd}} _ {k} (A): = 1 / \ left ({\ frac {\ sum _ {B \ in N_ {k} (A)} {\ mbox {Accessibility distance }} _ {k} (A, B)} {| N_ {k} (A) |}} \ right)}$

This density is therefore the reciprocal of the average reachable distance of the object from its neighbors, not the other way round the average reachable distance of the neighbors from , which would be by definition . If there are duplicates in a data record, the reachability of these objects becomes infinite. ${\ displaystyle A}$ ${\ displaystyle A}$ ${\ displaystyle {\ mbox {k-distance}} (A)}$ ${\ displaystyle k + 1}$

The local accessibility density is now compared with that of the neighbors:

${\ displaystyle {\ mbox {LOF}} _ {k} (A): = {\ frac {\ sum _ {B \ in N_ {k} (A)} {\ frac {{\ mbox {lrd}} ( B)} {{\ mbox {lrd}} (A)}}} {| N_ {k} (A) |}} = {\ frac {\ sum _ {B \ in N_ {k} (A)} { \ mbox {lrd}} (B)} {| N_ {k} (A) |}} / {\ mbox {lrd}} (A)}$

The “Local Outlier Factor” (LOF) is the “average accessibility density of the neighbors” divided by the accessibility density of the object itself. A value of approximately means that the object has a density comparable to its neighbors (ie is not an outlier). A value less than even means a denser region (which would be a so-called “inlier”), while significantly higher values indicate an outlier. ${\ displaystyle 1}$ ${\ displaystyle 1}$ ${\ displaystyle 1}$

advantages

LOF values visualized with ELKI . Although the upper right cluster has a density comparable to the lower left outliers, they are classified correctly.

In contrast to many global outlier detection methods, LOF can deal with areas of different density in the same data set. Points with a “medium” density in a “high” environment are classified as outliers by the LOF, while a point with a “medium” density in a “thin” environment is explicitly not recognized as such.

While the geometric intuition of LOF only makes sense in low-dimensional vector spaces, the method can be applied to any data on which a “dissimilarity” can be defined. It does not have to be a distance function in the strict mathematical sense; the triangle inequality , for example, is not required. The algorithm was used successfully on a wide variety of data sets, for example to detect attacks in computer networks , where it provided better detection rates than the comparison methods.

disadvantage

An important disadvantage of LOF is that the result values are difficult to interpret. Values around and less are certainly not outliers, but there is no clear rule as to the value from which a point is a significant outlier. In a very even data set, values of are conspicuous; in a data set with strong density fluctuations, a value can be like a completely normal data point. In the worst case, such differences even occur in different parts of the same data set. ${\ displaystyle 1}$ ${\ displaystyle 1.1}$ ${\ displaystyle 2}$

Extensions

Feature Bagging for Outlier Detection applies LOF in multiple projections and combines the results to get better results in high-dimensional data.

Local Outlier Probability (LoOP) is a method derived from LOF that statistically estimates the density in order to become less dependent on the exact value of k . In addition, the result is statistically normalized in the value range in order to provide values that are easier to interpret.

{\ displaystyle [0,1]}

Interpreting and Unifying Outlier Scores introduces a normalization for LOF and other procedures that statistically normalizes the scores in the interval in order to improve the usability of the result. ${\ displaystyle [0,1]}$

Availability

A reference implementation is available in the ELKI software package , including implementations of comparison methods.

Individual evidence

^ MM Breunig, Hans-Peter Kriegel , RT Ng, J. Sander: LOF: Identifying Density-based Local Outliers . In: ACM SIGMOD Record . No. 29 , 2000, pp. 93 , doi : 10.1145 / 335191.335388 ( online PDF).

↑ Ar Lazarevic, Aysel Ozgur, Levent Ertoz, Jaideep Srivastava, Vipin Kumar: A comparative study of anomaly detection schemes in network intrusion detection . In: Proc. 3rd SIAM International Conference on Data Mining . 2003, p. 25-36 ( online PDF). Online ( Memento of the original from July 17, 2013 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2
^ A. Lazarevic, V. Kumar: Feature bagging for outlier detection . In: Proc. 11th ACM SIGKDD international conference on Knowledge Discovery in Data Mining . 2005, p. 157-166 , doi : 10.1145 / 1081870.1081891 .
↑ Hans-Peter Kriegel , P. Kröger, E. Schubert, A. Zimek: LoOP: Local Outlier Probabilities . In: Proc. 18th ACM Conference on Information and Knowledge Management (CIKM) . 2009, p. 1649 , doi : 10.1145 / 1645953.1646195 ( online PDF).
↑ Hans-Peter Kriegel , Peer Kröger, Erich Schubert, Arthur Zimek: Interpreting and Unifying Outlier Scores . In: Proc. 11th SIAM International Conference on Data Mining . 2011 ( online PDF).

[1] MM Breunig, Hans-Peter Kriegel , RT Ng, J. Sander: LOF: Identifying Density-based Local Outliers . In: ACM SIGMOD Record . No. 29 , 2000, pp. 93 , doi : 10.1145 / 335191.335388 ( online PDF).