Cluster (data analysis)
As cluster (sometimes clusters ) is known in the computer science and statistics , a group of data objects with similar characteristics. The set of clusters found in a data set is called clustering , and the method for calculating such a grouping is called cluster analysis . Not to a cluster associated data objects is referred to as outliers ( English outlier ) or noise ( English noise ).
The core idea of a cluster is that objects in the same cluster have "similar" properties and are different from objects that are not in the same cluster.
There are already different formulations when it comes to cluster membership.
- With hard clustering , each data object belongs entirely or not at all to a cluster.
- With soft clustering , each data object belongs to a certain extent to a cluster.
Furthermore one can distinguish:
- With strictly partitioning clustering , each data object belongs to exactly one cluster.
- In the case of a strictly partitioning clustering with outliers , a data object cannot belong to any cluster (in the case of soft clustering, the proportions can add up to less than 1).
- In the case of an overlapping clustering , an object can also belong to several clusters (in the case of soft clustering, the proportions can add up to more than 1).
Even within clusters, there can be subgroups that are more similar to each other than the rest of the larger group. If one has such a structure, one speaks of hierarchical clusters or hierarchical clustering . Methods that can find hierarchical clusters are, for example, hierarchical cluster analysis , OPTICS and BIRCH .
Models of clusters
Different cluster analysis algorithms often use different terms for clusters. This often leads to problems of understanding, since the results of one method do not have to be similar in the sense of another method.
The k-means algorithm describes clusters by their center points (or the Voronoi cells resulting from them ), the EM algorithm describes clusters by center points and a covariance matrix, while DBSCAN calculates "dense-connected" sets of any shape as a cluster.
Depending on the cluster term used, different structures may or may not be found. In the example shown here, the existing clusters cannot be found accurately by the k-means algorithm using its cluster model. The more complex model of the EM algorithm, on the other hand, is ideally suited to describe this data, since it was generated from a normal distribution.
A subspace cluster is a cluster that is not conspicuous in all attributes or attribute combinations. Only when the data is appropriately projected can one recognize the greater similarity of the cluster objects compared to the others.
In the case of subspace clusters, a distinction can be made between axially parallel clusters (based on a selection of attributes) and any oriented correlation clusters .
Methods for subspace cluster methods are for example CLIQUE, ORCLUS, SubClu, PreDeCon, PROCLUS, HiSC, HiCO, 4C, ERiC and CASH.
Calculation of clusters
There are numerous methods (so-called cluster analysis algorithms) for calculating clusters. These differ significantly in what models they use for clusters. In many classic methods such as the k-means algorithm , the EM algorithm , hierarchical cluster analysis and DBSCAN , the cluster model is in the foreground, and there are sometimes several specific algorithms that allow an (at least locally) optimal solution for this model Find. Many newer methods, however, no longer have a correspondingly clearly defined model.
Evaluation of clusters
The evaluation of found clusters is not a simple problem, especially if the clusters come from different methods. There is a risk of overfitting if the valuation method is too similar to one of the methods used - that is, one is ultimately investigating which method is most similar to the valuation method.
An internal evaluation is used when no additional information is used for the evaluation, but only the objects of the data set are used for the evaluation. Typically, distance measurements are used for this, for example the average distance between two cluster objects. Internal scoring usually favors clustering results built according to the same model. For example , clusters found by -Means naturally have shorter average distances than DBSCAN clusters. This type of evaluation is therefore particularly useful if you want to evaluate different results of the same procedure, for example from several runs of a randomized procedure such as the -Means algorithm. The silhouette coefficient is an internal measure for evaluating distance-based clusterings that is independent of the number of clusters . It is particularly suitable for comparing results of -Means with different values of , since it is independent of the number of clusters .
The external assessment adds information that was not used during the cluster analysis. For example , if the data is classified into classes , then the agreement of the cluster with a class can be used for evaluation. The problems with this approach are that, on the one hand, suitable information is not always available and, on the other hand, the aim of the cluster analysis is precisely to discover a new structure, and the evaluation based on a known structure therefore makes only limited sense. In addition, several overlapping structures can exist in the data. Because it is linked to the existing classification, this evaluation prefers informed methods from the area of machine learning over uninformed methods from (real) cluster analysis .
- I. Färber, S. Günnemann, H.-P. Kriegel , P. Kröger, E. Müller, E. Schubert, T. Seidl, A. Zimek: On Using Class-Labels in Evaluation of Clusterings . In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC . 2010 ( uni-muenchen.de [PDF]).