Cluster analysis
Under cluster analysis ( clustering algorithms , occasionally: urban analysis ) refers to method for discovering similar structures in (usually relatively large) databases. The groups of "similar" objects found in this way are referred to as clusters , the group assignment as clustering . The similarity groups found can be graph-theoretical, hierarchical, partitioning or optimizing. Cluster analysis is an important discipline in data mining , the analysis step of the knowledge discovery in databases process. The goal of cluster analysis is to identify new groups in the data (as opposed to classification , which assigns data to existing classes). One speaks of an “uninformed procedure” because it does not depend on prior class knowledge. These new groups can then be used, for example, for automated classification , for the recognition of patterns in image processing or for market segmentation (or in any other method that relies on such prior knowledge).
The numerous algorithms differ mainly in their concept of similarity and group, their cluster model , their algorithmic approach (and thus their complexity) and the tolerance towards disruptions in the data. As a rule, however, only an expert can assess whether the “knowledge” generated by such an algorithm is useful. A clustering algorithm can, under certain circumstances, reproduce existing knowledge (for example, subdivide personal data into the known groups “male” and “female”) or generate groups that are not helpful for the application. The groups found cannot often be described verbally (e.g. “male persons”); common characteristics are usually only identified through a subsequent analysis. When using cluster analysis, it is therefore often necessary to try different methods and different parameters, to preprocess the data and, for example, to select or omit attributes.
example
Applied to a data set of vehicles, a clustering algorithm (and a subsequent analysis of the groups found) could provide the following structure, for example:
vehicles | |||||||||||||||||||||||||||||||
Cycles | |||||||||||||||||||||||||||||||
truck | Car | Rickshaws | Scooter mobiles | ||||||||||||||||||||||||||||
Please note the following:
- The algorithm itself does not provide an interpretation ("LKW") of the groups found. A separate analysis of the groups is necessary for this.
- A human would view a cycle rickshaw as a subset of bicycles. For a clustering algorithm, however, 3 wheels are often a significant difference that they share with a three-wheel scooter mobile .
- The groups are often not "pure", so there can be small trucks in the group of cars, for example.
- There are often additional groups that were not expected (“police cars”, “convertibles”, “red cars”, “cars with xenon headlights”).
- Some groups are not found, for example "motorcycles" or "recumbent bikes".
- Which groups are found depends heavily on the algorithm, parameters and object attributes used.
- Often nothing (meaningful) is found.
- In this example only known knowledge was (re-) found - as a method for “knowledge discovery”, the cluster analysis failed here.
history
Historically, the process comes from taxonomy in biology , where an order of living beings is determined through a clustering of related species. However, no automatic calculation methods were originally used there. In the meantime, to determine the relationship of organisms, among other things, their gene sequences can be compared ( see also : cladistics ). The method for correlation analysis was later introduced in the social sciences because it is particularly suitable because of the generally low scale level of the data in these disciplines.
Principle / functionality
Mathematical modeling
The properties of the objects to be examined are mathematically interpreted as random variables . They are usually represented in the form of vectors as points in a vector space , the dimensions of which form the characteristics of the object. Areas in which points accumulate ( point cloud ) are called clusters. In scatter diagrams , the distances between the points or the variance within a cluster serve as so-called proximity measures, which express the similarity or difference between the objects.
A cluster can also be defined as a group of objects that have a minimum distance sum with respect to a calculated center of gravity . This requires the selection of a distance measure . In certain cases, the distances (or, conversely, the similarities ) between the objects are known directly, so that they do not have to be determined from the representation in vector space.
Basic approach
In a total set / group with different objects, the objects that are similar are combined into groups (clusters).
The following music analysis is given as an example. The values result from the proportion of pieces of music that the user buys online per month.
Person 1 | Person 2 | Person 3 | |
---|---|---|---|
pop | 2 | 1 | 8th |
skirt | 10 | 8th | 1 |
jazz | 3 | 3 | 1 |
In this example one would intuitively divide people into two groups. Group 1 consists of person 1 & 2 and group 2 consists of person 3. Most cluster algorithms would do that too. This example is so clear only because of the selected values, at the latest with values that are closer together and more variables (here music styles) and objects (here people) it is easy to imagine that the division into groups is no longer so trivial.
To put it somewhat more precisely and in a more abstract way: the objects of a heterogeneous (described by different values of their variables) total are summarized with the help of the cluster analysis into subgroups (clusters / segments) that are as homogeneous as possible (the differences of the variables as small as possible). In our music example, the group of all music listeners (a very heterogeneous group) can be divided into the groups of jazz listeners, rock listeners, pop listeners, etc. (each relatively homogeneous) or those listeners with similar preferences are combined into the corresponding group.
A cluster analysis (e.g. for market segmentation) takes place in the following steps:
Steps to cluster formation:
- Variable selection: Selection (and collection) of the variables suitable for the investigation. If the variables of the objects / elements are not yet known / specified, all the variables that are important for the investigation must be determined and then determined.
- Proximity determination: Selection of a suitable measure of proximity and determination of the distance or similarity values (depending on the measure of proximity) between the objects using the measure of proximity. Depending on the type of variable or the scale type of the variable, a corresponding distance function is used to determine the distance (distance) between two elements or a similarity function is used to determine the similarity. The variables are first compared individually and the total distance (or similarity) is calculated from the distance between the individual variables. The function for determining distance or similarity is also called a measure of proximity. The distance or similarity value determined by a measure of proximity is called proximity. If all objects are compared with one another, a proximity matrix results which assigns a proximity to two objects.
- Cluster formation: Determination and implementation of a suitable cluster procedure (s) in order to subsequently be able to form groups / clusters with the help of this procedure (this reduces the proximity matrix). As a rule, several methods are combined, e.g. B .:
- Finding outliers using the single linkage method (hierarchical method)
- Determination of the number of clusters by the Ward method (hierarchical method)
- Determination of the cluster composition by the exchange procedure (partitioning procedure)
Further steps of the cluster analysis:
- Determination of the number of clusters by considering the variance within and between the groups. Here it is determined how many groups are actually formed, because there is no termination condition for the clustering itself. For example, in an agglomerative process, amalgamation is therefore carried out until only one entire group is left.
- Interpretation of the clusters (depending on the results, e.g. by t-value)
- Assessment of the quality of the cluster solution (determination of the selectivity of the variables, group stability)
Different measures of proximity / scales
The scales indicate the range of values that the observed variable (s) of the object can assume. Depending on the type of scale, you have to use a suitable measure of proximity. There are three main categories of scales:
- Binary scales
the variable can have two values, e.g. B. 0 or 1, male or female.
Proximity measures used (examples):- Jaccard coefficient (measure of similarity)
- Lance Williams measure (distance measure)
- Nominal scales
the variable can have different values in order to make a qualitative distinction, e.g. B. whether pop, rock or jazz is preferred.
Proximity measures used (examples):- Chi-square measure (distance measure)
- Phi square measure (distance measure)
- Metric scales
the variable takes a value on a predetermined scale to make a quantitative statement, e.g. B. How much a person likes to hear pop on a scale of 1 to 10.
Proximity measures used (examples):- Pearson correlation coefficient (measure of similarity)
- Euclidean metric (distance measure)
- Minkowski metric (distance measure)
Forms of group formation (group membership)
Three different forms of group formation (group membership) are possible. In the case of non-overlapping groups, each object is assigned to only one group (segment, cluster), in the case of overlapping groups, an object can be assigned to several groups, and in the case of fuzzy groups, one element belongs to each group with a certain degree of applicability.
Non-overlapping | |||||||||||||
Forms of group formation | Overlapping | ||||||||||||
Fuzzy | |||||||||||||
A distinction is made between “hard” and “soft” clustering algorithms. Hard methods (eg. K-means , Spectral Clustering , core-based principal component analysis ( kernel principal component analysis , in short, kernel PCA )) assign each data point exactly one cluster to, whereas in soft methods (eg EM algorithm. Gaussian mixed models ( Gaussian Mixture model , in short: GMM )) to each data point for each cluster is assigned a degree with which this data point can be assigned to this cluster. Soft methods are particularly useful when the data points are distributed relatively homogeneously in space and the clusters only appear as regions with increased data point density, i.e. i.e. if there is e.g. B. smooth transitions between the clusters or background noise (hard methods are useless in this case).
Differentiation of the cluster procedures
Cluster processes can be divided into graph-theoretical, hierarchical, partitioning and optimizing processes as well as further sub-processes.
- Partitioning methods
use a given partitioning and rearrange the elements through exchange functions until the objective function used reaches an optimum. However, additional groups cannot be formed because the number of clusters is determined at the beginning (see hierarchical procedure). - Hierarchical procedures
are based on the finest (agglomerative or bottom-up ) or coarsest (divisive or top-down ) partition (see top-down and bottom-up ). The coarsest partition corresponds to the totality of all elements and the finest partition contains only one element or each element forms its own group / partition. Clusters can then be formed by splitting or combining. Once groups have been formed, they can no longer be dissolved or individual elements exchanged (see partitioning methods). Agglomerative processes are much more common in practice (e.g. in market segmentation in marketing).
graph theoretical | |||||||||||||||||||||
divisive | |||||||||||||||||||||
hierarchical | |||||||||||||||||||||
agglomerative | |||||||||||||||||||||
Cluster process | |||||||||||||||||||||
Exchange procedure | |||||||||||||||||||||
partitioning | |||||||||||||||||||||
iterated minimum distance method | |||||||||||||||||||||
optimizing | |||||||||||||||||||||
Note that you can still differ various other methods and algorithms, including supervised ( supervised ) and non-supervised ( unsupervised ) algorithms or model-based algorithms in which an assumption is made about the underlying distribution of the data (eg. B. mixture-of-Gaussians model).
Procedure
A large number of clustering methods have been developed in a wide variety of application areas. One can differentiate between the following types of procedures:
- (center-based) partitioning methods,
- hierarchical procedures,
- density-based method and
- combined procedures.
The first two types of procedures are the classic cluster procedures, while the other procedures are more recent.
Partitioning clustering methods
What the partitioning methods have in common is that the number of clusters must first be determined (disadvantage). Then, to the assignment of observations determined cluster centers and moved as long as this iteratively, to which a predetermined error function is minimized does not change cluster centers. One advantage is that objects can change their cluster membership while the cluster centers are being moved.
- K-means algorithm
- The cluster centers are determined randomly and the sum of the squared Euclidean distances of the objects to their nearest cluster center is minimized. The cluster centers are updated by averaging all objects in a cluster.
- k-means ++
- Objects are also randomly selected as cluster centers so that they are approximately uniformly distributed in the space of the objects. This leads to a faster algorithm.
- k median algorithm
- Here the sum of the Manhattan distances between the objects and their nearest cluster center is minimized. The cluster centers are updated by calculating the median of all objects in a cluster. This means that outliers in the data have less influence.
- k-Medoids or Partitioning Around Medoids (PAM)
- The cluster centers are always objects here. By shifting cluster centers to a neighboring object, the sum of the distances to the nearest cluster center is minimized. In contrast to the k-means method, only the distances between the objects are required and not the coordinates of the objects.
- Fuzzy c-means algorithm
- A degree of membership in a cluster is calculated for each object, often from the real-valued interval [0.1] (degree of membership = 1: object belongs completely to a cluster, degree of membership = 0: object does not belong to the cluster). The following applies: the further away an object is from the cluster center, the smaller its degree of membership in this cluster. As in the k-median method, the cluster centers are then shifted, but objects that are far away (small degree of membership) have less influence on the shift than objects that are close. This also enables a soft cluster assignment: Every object belongs to every cluster with a corresponding degree of membership.
- EM clustering
- The clusters are modeled as multivariate normal distributions. With the help of the EM algorithm , the unknown parameters ( with ) the normal distributions are estimated iteratively. In contrast to k-means, a soft cluster assignment is achieved: with a certain probability every object belongs to every cluster and every object thus influences the parameters of every cluster.
- Affinity propagation
- Affinity propagation is a deterministic message passing algorithm that finds the optimal cluster centers. Depending on the type of data, e.g. B. the negative Euclidean distance can be used. The distances between the objects and their nearest cluster center are minimized. The cluster centers are updated by calculating the responsibility and availability as well as their sum for all objects.
Hierarchical cluster procedures
A specific family of distance-based methods for cluster analysis is called hierarchical cluster analysis. In this case, clusters consist of objects that are closer to one another (or, conversely, are more similar) than to objects in other clusters. A hierarchy of clusters is built up: on the one hand a cluster that contains all objects, and on the other hand as many clusters as there are objects, i.e. In other words, each cluster contains exactly one object. There are two important types of procedures:
- the divisive cluster process, in which all objects are initially considered to belong to a cluster and then the clusters are gradually divided into smaller and smaller clusters until each cluster consists of only one object (also: "top-down process")
- Divisive Analysis Clustering (DIANA)
- You start with a cluster that contains all objects. In the cluster with the largest diameter , the object is sought that has the greatest mean distance or dissimilarity to the other objects in the cluster. This is the core of the splinter cluster. Any object that is close enough to the splinter cluster is iteratively added to it. The entire process is repeated until each cluster consists of only one object.
- the agglomerative clustering process, in which each object first forms a cluster and then the clusters are gradually combined into ever larger clusters until all objects belong to one cluster (also: "bottom-up process"). The procedures in this family differentiate, on the one hand, according to the distance or similarity measures used (between objects, but also between entire clusters) and, mostly more importantly, according to their merging rule , which clusters are combined in one step. The fusing methods differ in the way in which the distance of the fused cluster to all other clusters is calculated. Important merger methods are:
- Single linkage
- The clusters whose closest objects have the smallest distance or dissimilarity are merged.
- Ward method
- The clusters that have the smallest increase in the total variance are merged.
The following applies to both methods: once clusters have been formed, they can no longer be changed or individual objects can be exchanged. The structure is either only refined (“divisive”) or only generalized (“agglomerative”), so that a strict cluster hierarchy is created. You can no longer tell from the resulting hierarchy how it was calculated.
Density-based methods
In density-based clustering, clusters are viewed as objects in a d-dimensional space that are close together, separated by areas of lower density.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Objects that have at least other objects at a given distance are core objects. Two core objects whose distance is less than belong to the same cluster. Non-core objects that are close to a cluster are added to it as edge objects. Objects that are neither core objects nor edge objects are noise objects.
- OPTICS (Ordering Points To Identify the Clustering Structure)
- The algorithm extends DBSCAN so that clusters of different densities can be recognized. The choice of the parameter is no longer so crucial to find the cluster structure of the objects.
- Maximum margin clustering
- (Empty) areas are searched for in the space of the objects that lie between two clusters. From this, cluster boundaries are determined and thus also the clusters. The technology is closely tied to support vector machines .
Grid-based methods
With grid-based clustering methods, the data space is divided into a finite number of cells regardless of the data. The greatest advantage of this approach is the low asymtotic complexity in the low-dimensional, since the running time depends on the number of grid cells. However, as the number of dimensions increases, the number of grid cells increases exponentially. Representatives are STING and CLIQUE . In addition, grids can be used to accelerate other algorithms, for example to approximate k-means or to calculate DBSCAN (GriDBSCAN).
- The STING algorithm (Statistical Information Grid-based Clustering) divides the data space recursively into rectangular cells. Statistical information for each cell is precomputed at the lowest recursion level. Relevant cells are then calculated using a top-down approach and returned.
- CLIQUE (CLustering In QUEst) works in two phases: First, the data space is partitioned into densely populated d-dimensional cells. The second phase determines larger clusters from these cells by using a greedy strategy to identify the largest possible regions of densely populated cells.
Combined procedures
In practice, combinations of methods are often used. One example is to first carry out a hierarchical cluster analysis in order to determine a suitable number of clusters, and then another k-means clustering in order to improve the result of the clustering. Often, additional information can be used in special situations so that e.g. B. the dimension or the number of objects to be clustered is reduced.
- Spectral clustering
- To cluster objects can one as nodes Graphs be construed and the weighted edges give distance or dissimilarity again. The Laplace matrix, a special transform of the adjacency matrix (matrix of the similarity between all objects), has the eigenvalue zero with the multiplicity of connected components (clusters) . Therefore one examines the smallest eigenvalues of the Laplace matrix and the corresponding -dimensional eigenspace ( ). Instead of in a high-dimensional space is now in the low-dimensional eigenspace, z. B. with the k-means method, clustered.
- Multiview clustering
- It is assumed here that several distance or similarity matrices (so-called views ) of the objects can be generated. One example is websites as objects to be clustered: a distance matrix can be calculated on the basis of the commonly used words, a second distance matrix on the basis of the links. Then a clustering (or a clustering step) is carried out with one distance matrix and the result is used as input for a clustering (or a clustering step) with the other distance matrix. This is repeated until the cluster membership of the objects stabilizes.
- Balanced iterative reducing and clustering using hierarchies (BIRCH)
- For (very) large data sets, a preclustering is carried out first. The clusters obtained in this way (no longer the objects) are then z. B. further clustered with a hierarchical cluster analysis. This is the basis of the two-step clustering developed especially for SPSS and used there .
Biclustering
Biclustering, co-clustering or two-mode clustering is a technique that enables the simultaneous clustering of rows and columns of a matrix. Numerous biclustering algorithms have been developed for bioinformatics, including: Block Clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-Bicluster, δ-pCluster, δ-Pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving Submatrices), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis), Robust Biclustering Algorithm (RoBA), Crossing Minimization, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms are also proposed and used in other application areas. There they can be found under the names co-clustering, bidimensional clustering and subspace clustering.
Evaluation
As with all machine learning processes, the results of each cluster process must be subjected to an evaluation. The quality of a fully calculated division of the data points into groups can be assessed using various metrics. The selection of a suitable metric must always be made with regard to the data used, the given question and the selected clustering method.
The silhouette coefficient introduced by Peter J. Rousseeuw in 1987 is one of the evaluation indicators that are frequently used in practice . This calculates a value for each data point that indicates how well the assignment to the selected group has been made compared to all of its groups. With a run-time complexity of O (n²) the silhouette coefficient is slower than many clustering methods, and so cannot be used on large data.
The silhouette coefficient is a procedure that does not require external information for evaluation. These internal validity metrics have the advantage that there is no need to know about the correct assignment in order to evaluate the result. In addition to internal key figures, science distinguishes between external and relative metrics.
literature
Basics and procedures
- Martin Ester, Jörg Sander: Knowledge Discovery in Databases. Techniques and Applications. Springer, Berlin 2000, ISBN 3-540-67328-8 .
- K. Backhaus, B. Erichson, W. Plinke, R. Weiber: Multivariate analysis methods. An application-oriented introduction. Springer, 2003, ISBN 3-540-00491-2 .
- S. Bickel, T. Scheffer: Multi-View Clustering. In: Proceedings of the IEEE International Conference on Data Mining. 2004
- J. Shi, J. Malik: Normalized Cuts and Image Segmentation. In: Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition. Puerto Rico 1997.
- Otto Schlosser : Introduction to the social science context analysis . Rowohlt, Reinbek near Hamburg 1976, ISBN 3-499-21089-4 .
application
- J. Bacher, A. Pöge, K. Wenzig: Cluster analysis - application- oriented introduction to classification methods. 3. Edition. Oldenbourg, Munich 2010, ISBN 978-3-486-58457-8 .
- J. Bortz: Statistics for social scientists. Springer, Berlin 1999, chap. 16 Cluster Analysis
- W. Härdle, L. Simar: Applied Multivariate Statistical Analysis. Springer, New York 2003
- C. Homburg, H. Krohmer: Marketing Management: Strategy - Instruments - Implementation - Corporate Management. 3. Edition. Gabler, Wiesbaden 2009, Chapter 8.2.2
- H. Moosbrugger, D. Frank: Cluster analytical methods in personality research. An application-oriented introduction to taxometric classification methods. Huber, Bern 1992, ISBN 3-456-82320-7 .
Individual evidence
- ↑ cf. among others Otto Schlosser : Introduction to the social science context analysis . Rowohlt, Reinbek near Hamburg 1976, ISBN 3-499-21089-4 .
- ↑ D. Arthur, S. Vassilvitskii: k-means ++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027-1035.
- ^ S. Vinod: Integer programming and the theory of grouping . In: Journal of the American Statistical Association . tape 64 , 1969, p. 506-517 , doi : 10.1080 / 01621459.1969.10500990 , JSTOR : 2283635 .
- ^ JC Bezdek: Pattern recognition with fuzzy objective function algorithms . Plenum Press, New York 1981.
- ^ AP Dempster, NM Laird, DB Rubin: Maximum Likelihood from Incomplete Data via the EM algorithm. In: Journal of the Royal Statistical Society. Series B, 39 (1), 1977, pp. 1-38, doi : 10.1111 / j.2517-6161.1977.tb01600.x .
- ↑ L. Kaufman, PJ Roussew: Finding Groups in Data - An Introduction to Cluster Analysis. John Wiley & Sons 1990.
- ↑ K. Florek, J. Łukasiewicz, J. Perkal, H. Steinhaus, S. Zubrzycki: Taksonomia wrocławska. In: Przegląd Antropol. 17, 1951, pp. 193-211.
- ↑ K. Florek, J. Łukaszewicz, J. Perkal, H. Steinhaus, S. Zubrzycki: Sur la liaison et la division des points d'un ensemble fini. In: Colloquium Mathematicae. Vol. 2, No. 3-4, Institute of Mathematics Polish Academy of Sciences 1951, pp. 282-285.
- ↑ LL McQuitty: Elementary linkage analysis for isolating orthogonal and oblique types and typal relevancies. In: Educational and Psychological Measurement. 1957, pp. 207-229, doi : 10.1177 / 001316445701700204 .
- ^ PH Sneath: The application of computers to taxonomy. In: Journal of General Microbiology. 17 (1), 1957, pp. 201-226, doi : 10.1099 / 00221287-17-1-201 .
- ↑ JH Ward Jr: Hierarchical grouping to optimize an objective function In: Journal of the American Statistical Association ,. 58 (301), 1963, pp. 236-244, doi : 10.1080 / 01621459.1963.10500845 JSTOR 2282967 .
- ↑ M. Ester, HP Kriegel, J. Sander, X. Xu: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings. Vol. 96, 1996, pp. 226-231.
- ↑ Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel , Jörg Sander: OPTICS: Ordering Points To Identify the Clustering Structure . In: ACM SIGMOD international conference on Management of data . ACM Press, 1999, pp. 49-60 ( CiteSeerX ).
- ↑ L. Xu, J. Neufeld, B. Larson, D. Schuurmans: Maximum margin clustering. In: Advances in neural information processing systems. 2004, pp. 1537-1544.
- ↑ STING | Proceedings of the 23rd International Conference on Very Large Data Bases. Retrieved July 6, 2020 .
- ↑ Automatic subspace clustering of high dimensional data for data mining applications | Proceedings of the 1998 ACM SIGMOD international conference on Management of data. Retrieved July 6, 2020 .
- ^ WE Donath, AJ Hoffman: Lower bounds for the partitioning of graphs In: IBM Journal of Research and Development. 17 (5), 1973, pp. 420-425, doi : 10.1147 / around 175.0420 .
- ^ M. Fiedler: Algebraic connectivity of graphs. In: Czechoslovak Mathematical Journal ,. 23 (2), 1973, pp. 298-305.
- ↑ S. Bickel, T. Scheffer: Multi-View Clustering. In: ICDM. Vol. 4, Nov 2004, pp. 19-26, doi : 10.1109 / ICDM.2004.10095 .
- ↑ Peter J. Rousseeuw: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis . In: Journal of Computational and Applied Mathematics . tape 20 . Elsevier, November 1987, pp. 53-65 , doi : 10.1016 / 0377-0427 (87) 90125-7 .
- ↑ Zaki, Mohammed J .; Meira, Wagner .: Data mining and analysis: fundamental concepts and algorithms . Cambridge University Press, 2014, ISBN 978-0-511-81011-4 , pp. 425 ff .