DBSCAN

from Wikipedia, the free encyclopedia

DBSCAN ( Density-Based Spatial Clustering of Applications with Noise , e.g. density-based spatial cluster analysis with noise ) is a data mining algorithm for cluster analysis developed by Martin Ester, Hans-Peter Kriegel , Jörg Sander and Xiaowei Xu . It is one of the most cited algorithms in this field. The algorithm works density-based and is able to recognize several clusters. Noise points are ignored and returned separately.

Overview

Points at A are key points. Points B and C are densely accessible from A and are therefore densely connected and belong to the same cluster. Point N is neither a core point nor density-accessible, i.e. noise. ( = 3 or = 4)

The basic idea of ​​the algorithm is the concept of density connectedness . Two objects are considered to be dense-connected if there is a chain of dense objects ( core objects , with more than neighbors) connecting these points. The objects linked by the same core objects form a cluster. Objects that are not part of a density-related cluster are as noise (engl. Be noise ), respectively.

There are three types of points in DBSCAN:

  • Core objects that are themselves tight .
  • Density-achievable objects. These are objects that can be reached by a core object of the cluster, but are not themselves dense . These clearly form the edge of a cluster.
  • Intoxication points that are neither dense nor density-accessible .

The algorithm has two parameters: and . The neighborhood length of a point defines : A second point can be reached from a point if its distance is less than . defines when an object is dense (i.e. a core object ): if it has at least -reachable neighbors.

Density achievable points can be used as a cluster of more accessible tight- be. These points are non-deterministically assigned to one of the possible clusters by the algorithm. This also implies that density-relatedness is not transitive; Density accessibility is not symmetrical.

Important properties

DBSCAN is precise in terms of the definition of density-related and noise . This means that two densely connected objects are guaranteed to be in the same cluster, while noise objects are safely in noise . The algorithm is not exact for objects that can only be reached densely ; these are only assigned to one cluster, not to all possible ones.

In contrast to the K-Means algorithm , for example , it does not have to be known in advance how many clusters exist.

The algorithm can recognize clusters of any shape (e.g. not just spherical).

DBSCAN is largely deterministic and sequence-independent: Regardless of the order in which objects are stored or processed in the database, the same clusters are created (with the exception of non-core objects that can only be reached densely and the cluster numbering).

The algorithm can be used with any distance functions and similarity measures . In contrast to the K-Means algorithm , no geometric space is required, as no center point has to be calculated.

DBSCAN itself is of linear complexity . Essentially, each property is only visited once. However, the calculation of the neighborhood is usually not possible in a constant time (without corresponding precalculations). Without the use of precalculated data or a suitable index structure , the algorithm is therefore of quadratic complexity.

DBSCAN algorithm

The original version of DBSCAN can be described by the following pseudocode:

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = D.regionQuery(P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, N, C, eps, MinPts)
expandCluster(P, N, C, eps, MinPts)
   add P to cluster C
   for each point P' in N
      if P' is not visited
         mark P' as visited
         N' = D.regionQuery(P', eps)
         if sizeof(N') >= MinPts
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C
         unmark P' as NOISE if necessary
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)

Alternatively DBSCAN could also recursively be implemented (instead of the join of done a recursive call), but this does not provide significant advantages.

DBSCAN (recursive formulation)

The recursive implementation shows more clearly how DBSCAN works. Since the recursion depth can become very high, the set-based normal formulation is to be preferred as an implementation.

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors(P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         add P to cluster C
         for P' in N
           if P' is not yet member of any cluster
             recursiveExpandCluster(P', C, eps, MinPts)
recursiveExpandCluster(P, C, eps, MinPts)
   add P to cluster C
   if P is not visited
     mark P as visited
     N = getNeighbors(P, eps)
     if sizeof(N) >= MinPts
       for P' in N
         if P' is not yet member of any cluster
           recursiveExpandCluster(P', C, eps, MinPts)

Generalized DBSCAN

The generalized version of DBSCAN, GDBSCAN, abstracts from the neighborhood and the density criterion. These are replaced by a getNeighbors predicate and an isCorePoint predicate .

GDBSCAN(D, getNeighbors, isCorePoint)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors(P)
      if isCorePoint(P, N)
         C = next cluster
         expandCluster(P, N, C)
      else
         mark P as NOISE
expandCluster(P, N, C)
   add P to cluster C
   for each point P' in N
      if P' is not visited
         mark P' as visited
         N' = getNeighbors(P')
         if isCorePoint(P', N')
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C

If you use a range query as getNeighbors and the test as the isCorePoint predicate, you obviously get the original DBSCAN algorithm.

Extensions from DBSCAN

Based on this algorithm, among other things

  • OPTICS - Ordering Points To Identify the Clustering Structure
  • Shared-Nearest-Neighbor-Clustering - Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data
  • PreDeCon - Density Connected Clustering with Local Subspace Preferences
  • SubClu - Density connected Subspace Clustering for High Dimensional Data
  • 4C - Computing Clusters of Correlation Connected Objects
  • ERiC - Exploring Complex Relationships of Correlation Clusters
  • HDBSCAN - Hierarchical Density Based Clustering

Use of DBSCAN

The DBSCAN algorithm is included in

  • ELKI (with flexible indexing and numerous variants)
  • Scikit-learn (with index for common metrics)
  • Weka (implemented without index support and inefficient)

Individual evidence

  1. Microsoft Academic Search: Most Cited Data Mining Articles. (No longer available online.) Archived from the original on April 21, 2010 ; accessed on May 10, 2010 (DBSCAN is approx. 20-25 places). Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / academic.research.microsoft.com
  2. Martin Ester, Hans-Peter Kriegel , Jörg Sander, Xiaowei Xu: A density-based algorithm for discovering clusters in large spatial databases with noise . In: Evangelos Simoudis, Jiawei Han, Usama M. Fayyad (Eds.): Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) . AAAI Press, 1996, ISBN 1-57735-004-9 , pp. 226-231 ( online PDF).
  3. Jörg Sander, Martin Ester, Hans-Peter Kriegel and Xiaowei Xu: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications . In: Data Mining and Knowledge Discovery . 2nd Edition. tape 2 . Springer, Berlin 1998, doi : 10.1023 / A: 1009745219419 .
  4. Jörg Sander: Generalized Density-Based Clustering for Spatial Data Mining . Herbert Utz Verlag, Munich 1998, ISBN 3-89675-469-6 .
  5. Ricardo JGB Campello, Davoud Moulavi, Joerg Sander: Density-Based Clustering Based on Hierarchical Density Estimates . In: Advances in Knowledge Discovery and Data Mining . Springer Berlin Heidelberg, Berlin, Heidelberg 2013, ISBN 978-3-642-37455-5 , pp. 160–172 , doi : 10.1007 / 978-3-642-37456-2_14 ( springer.com [accessed August 1, 2018]).