DBSCAN

DBSCAN ( Density-Based Spatial Clustering of Applications with Noise , e.g. density-based spatial cluster analysis with noise ) is a data mining algorithm for cluster analysis developed by Martin Ester, Hans-Peter Kriegel , Jörg Sander and Xiaowei Xu . It is one of the most cited algorithms in this field. The algorithm works density-based and is able to recognize several clusters. Noise points are ignored and returned separately.

Overview

Points at A are key points. Points B and C are densely accessible from A and are therefore densely connected and belong to the same cluster. Point N is neither a core point nor density-accessible, i.e. noise. ( = 3 or = 4)

{\ displaystyle minPts}

{\ displaystyle minPts}

The basic idea of the algorithm is the concept of density connectedness . Two objects are considered to be dense-connected if there is a chain of dense objects ( core objects , with more than neighbors) connecting these points. The objects linked by the same core objects form a cluster. Objects that are not part of a density-related cluster are as noise (engl. Be noise ), respectively. ${\ displaystyle minPts}$

There are three types of points in DBSCAN:

Core objects that are themselves tight .
Density-achievable objects. These are objects that can be reached by a core object of the cluster, but are not themselves dense . These clearly form the edge of a cluster.
Intoxication points that are neither dense nor density-accessible .

The algorithm has two parameters: and . The neighborhood length of a point defines : A second point can be reached from a point if its distance is less than . defines when an object is dense (i.e. a core object ): if it has at least -reachable neighbors. ${\ displaystyle \ varepsilon}$ ${\ displaystyle minPts}$ ${\ displaystyle \ varepsilon}$ ${\ displaystyle \ varepsilon}$ ${\ displaystyle minPts}$ ${\ displaystyle minPts}$ ${\ displaystyle \ varepsilon}$

Density achievable points can be used as a cluster of more accessible tight- be. These points are non-deterministically assigned to one of the possible clusters by the algorithm. This also implies that density-relatedness is not transitive; Density accessibility is not symmetrical.

Important properties

DBSCAN is precise in terms of the definition of density-related and noise . This means that two densely connected objects are guaranteed to be in the same cluster, while noise objects are safely in noise . The algorithm is not exact for objects that can only be reached densely ; these are only assigned to one cluster, not to all possible ones.

In contrast to the K-Means algorithm , for example , it does not have to be known in advance how many clusters exist.

The algorithm can recognize clusters of any shape (e.g. not just spherical).

DBSCAN is largely deterministic and sequence-independent: Regardless of the order in which objects are stored or processed in the database, the same clusters are created (with the exception of non-core objects that can only be reached densely and the cluster numbering).

The algorithm can be used with any distance functions and similarity measures . In contrast to the K-Means algorithm , no geometric space is required, as no center point has to be calculated.

DBSCAN itself is of linear complexity . Essentially, each property is only visited once. However, the calculation of the neighborhood is usually not possible in a constant time (without corresponding precalculations). Without the use of precalculated data or a suitable index structure , the algorithm is therefore of quadratic complexity. ${\ displaystyle \ varepsilon}$

DBSCAN algorithm

The original version of DBSCAN can be described by the following pseudocode:

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = D.regionQuery(P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, N, C, eps, MinPts)

expandCluster(P, N, C, eps, MinPts)
   add P to cluster C
   for each point P' in N
      if P' is not visited
         mark P' as visited
         N' = D.regionQuery(P', eps)
         if sizeof(N') >= MinPts
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C
         unmark P' as NOISE if necessary

regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)

Alternatively DBSCAN could also recursively be implemented (instead of the join of done a recursive call), but this does not provide significant advantages. ${\ displaystyle N}$

DBSCAN (recursive formulation)

The recursive implementation shows more clearly how DBSCAN works. Since the recursion depth can become very high, the set-based normal formulation is to be preferred as an implementation.

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors(P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         add P to cluster C
         for P' in N
           if P' is not yet member of any cluster
             recursiveExpandCluster(P', C, eps, MinPts)

recursiveExpandCluster(P, C, eps, MinPts)
   add P to cluster C
   if P is not visited
     mark P as visited
     N = getNeighbors(P, eps)
     if sizeof(N) >= MinPts
       for P' in N
         if P' is not yet member of any cluster
           recursiveExpandCluster(P', C, eps, MinPts)

Generalized DBSCAN

The generalized version of DBSCAN, GDBSCAN, abstracts from the neighborhood and the density criterion. These are replaced by a getNeighbors predicate and an isCorePoint predicate . ${\ displaystyle \ varepsilon}$ ${\ displaystyle minPts}$

GDBSCAN(D, getNeighbors, isCorePoint)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors(P)
      if isCorePoint(P, N)
         C = next cluster
         expandCluster(P, N, C)
      else
         mark P as NOISE

expandCluster(P, N, C)
   add P to cluster C
   for each point P' in N
      if P' is not visited
         mark P' as visited
         N' = getNeighbors(P')
         if isCorePoint(P', N')
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C

If you use a range query as getNeighbors and the test as the isCorePoint predicate, you obviously get the original DBSCAN algorithm. ${\ displaystyle \ varepsilon}$ ${\ displaystyle minPts}$

Extensions from DBSCAN

Based on this algorithm, among other things

OPTICS - Ordering Points To Identify the Clustering Structure
Shared-Nearest-Neighbor-Clustering - Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data
PreDeCon - Density Connected Clustering with Local Subspace Preferences
SubClu - Density connected Subspace Clustering for High Dimensional Data
4C - Computing Clusters of Correlation Connected Objects
ERiC - Exploring Complex Relationships of Correlation Clusters
HDBSCAN - Hierarchical Density Based Clustering

Use of DBSCAN

The DBSCAN algorithm is included in

ELKI (with flexible indexing and numerous variants)
Scikit-learn (with index for common metrics)
Weka (implemented without index support and inefficient)

Individual evidence

↑ Microsoft Academic Search: Most Cited Data Mining Articles. (No longer available online.) Archived from the original on April 21, 2010 ; accessed on May 10, 2010 (DBSCAN is approx. 20-25 places). Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2
↑ Martin Ester, Hans-Peter Kriegel , Jörg Sander, Xiaowei Xu: A density-based algorithm for discovering clusters in large spatial databases with noise . In: Evangelos Simoudis, Jiawei Han, Usama M. Fayyad (Eds.): Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) . AAAI Press, 1996, ISBN 1-57735-004-9 , pp. 226-231 ( online PDF).
↑ Jörg Sander, Martin Ester, Hans-Peter Kriegel and Xiaowei Xu: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications . In: Data Mining and Knowledge Discovery . 2nd Edition. tape 2 . Springer, Berlin 1998, doi : 10.1023 / A: 1009745219419 .
↑ Jörg Sander: Generalized Density-Based Clustering for Spatial Data Mining . Herbert Utz Verlag, Munich 1998, ISBN 3-89675-469-6 .
↑ Ricardo JGB Campello, Davoud Moulavi, Joerg Sander: Density-Based Clustering Based on Hierarchical Density Estimates . In: Advances in Knowledge Discovery and Data Mining . Springer Berlin Heidelberg, Berlin, Heidelberg 2013, ISBN 978-3-642-37455-5 , pp. 160–172 , doi : 10.1007 / 978-3-642-37456-2_14 ( springer.com [accessed August 1, 2018]).

[1] Microsoft Academic Search: Most Cited Data Mining Articles. (No longer available online.) Archived from the original on April 21, 2010 ; accessed on May 10, 2010 (DBSCAN is approx. 20-25 places). Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2

[2] Martin Ester, Hans-Peter Kriegel , Jörg Sander, Xiaowei Xu: A density-based algorithm for discovering clusters in large spatial databases with noise . In: Evangelos Simoudis, Jiawei Han, Usama M. Fayyad (Eds.): Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) . AAAI Press, 1996, ISBN 1-57735-004-9 , pp. 226-231 ( online PDF).

[3] Jörg Sander, Martin Ester, Hans-Peter Kriegel and Xiaowei Xu: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications . In: Data Mining and Knowledge Discovery . 2nd Edition. tape 2 . Springer, Berlin 1998, doi : 10.1023 / A: 1009745219419 .

[4] Jörg Sander: Generalized Density-Based Clustering for Spatial Data Mining . Herbert Utz Verlag, Munich 1998, ISBN 3-89675-469-6 .

[5] Ricardo JGB Campello, Davoud Moulavi, Joerg Sander: Density-Based Clustering Based on Hierarchical Density Estimates . In: Advances in Knowledge Discovery and Data Mining . Springer Berlin Heidelberg, Berlin, Heidelberg 2013, ISBN 978-3-642-37455-5 , pp. 160–172 , doi : 10.1007 / 978-3-642-37456-2_14 ( springer.com [accessed August 1, 2018]).