Fuzzy c -means algorithm

from Wikipedia, the free encyclopedia

In computer science , the fuzzy c-means algorithm , also the algorithm of the c fuzzy mean values , is an unsupervised clustering algorithm that is an extension of the k-means clustering algorithm . It was presented in a generalized form by Bezdek (1981).

Basic idea

In c-means clustering, the number of clusters is determined first (in k-means clustering, the number of clusters is designated with instead of with ). In the first step, cluster centers (circles at the bottom of the graphic) are determined randomly . In the second step, each object (rectangles at the bottom of the graphic) is assigned to the next cluster center. Then the (squared) distances between each object and its assigned cluster center are calculated and added up for all observations ( ). The goal is to make the value of as small as possible, i.e. H. Finding positions for the cluster centers so that the distance between each object and its associated cluster center is small. In the third step, the cluster centers are therefore recalculated from the objects that belong to a cluster. In the fourth step, each object is again assigned its next cluster center. This process is iterated until a stable solution is found. As the following graphic shows, objects can be assigned to different clusters in the course of the iteration process; compare the graphic for step 2 and step 4.

The disadvantage of k-means clustering is that each object is uniquely assigned to a cluster center in each step . As a result, the final solution can depend heavily on the choice of the position of the cluster centers at the beginning. Of course one is interested in a clear solution, as independent as possible of the position of the cluster centers at the beginning.

In the fuzzy c-means algorithm, therefore, each object is not uniquely assigned to a cluster center, but each object is assigned a set of weights that indicate how strong it belongs to a particular cluster. For example, for the red object in step 2, the weights could be

  • for the blue cluster
  • for the green cluster and
  • be for the red cluster.

These weights are then also used to calculate the weighted distance to all cluster centers. In the end, objects that are close to a certain cluster center will then have great weights for this cluster. The blue object near the blue cluster center in step 4 could e.g. B. the weights , and have. The two blue objects near the border to the green cluster could then e.g. B. the weights , and have.

The weights for each object represent so-called fuzzy numbers. The weights do not have to add up to one for each object either (as was done in this section for better understanding). The name fuzzy c-means comes from the derivation of k-means clustering .

Mathematical description of the algorithm

The term fuzzy describes a method of cluster analysis that allows an object to be assigned to more than just one cluster. This is made possible by the fact that a degree of membership ( membership degree ) of the object to each cluster is used. Each lies in the interval [0, 1]. The bigger , the stronger the affiliation of to .

The objective function of the fuzzy c-means algorithm is:

It gives the squared (Euclidean) distances between the points and the cluster centers (prototypes) from the matrix V. The partition matrix U shows the membership degrees . C is the number of clusters and N is the size of the data set. The "fuzzifier" m (> 1) determines how sharply the objects are assigned to the clusters. If one lets m run towards infinity, they approach the value , i.e. H. the membership of the points is the same for all clusters. If m is close to 1, the clustering is sharp; H. the affiliations are closer to 0 or 1. In practice, values ​​between 1 and 2.5 have proven to be suitable for m (cf. Stutz (1999)). The values and are determined by minimizing the objective function J. The objects are assigned to the clusters in such a way that the sum of the squared distances is minimal. The optimization takes place under secondary conditions:

  1. For each point the sum of the memberships to all clusters is equal to 1, i.e. H. for all valid ,
  2. The clusters are non-empty; H. for all true .

The Lagrange method is used to solve the minimization problem. The Lagrangian

is derived from u, v, and . The solution is:

and

The algorithm then consists of the following steps:

  1. Initialize a starting partition matrix
  2. Compute the prototypes in iteration step r
  3. Compute the partition matrix in iteration step r
  4. If so , stop. Otherwise back to step 2

Thereby indicates a small threshold value.

example

1000 Swiss Francs banknote.

The Swiss banknote data set consists of 100 real and 100 counterfeit Swiss 1000 franc banknotes . Six variables were recorded for each banknote:

  • The width of the banknote (WIDTH),
  • the height on the banknote on the left (LEFT),
  • the height on the banknote on the right (RIGHT),
  • the distance between the colored print and the upper edge of the banknote (UPPER),
  • the distance between the colored print and the lower edge of the banknote (LOWER) and
  • the diagonal (bottom left to top right) of the colored print on the banknote (DIAGONAL).

The two graphics below show the result of the k-means cluster analysis (left) and the fuzzy c-means cluster analysis (right) on the first two main components of the Swiss banknote data. The two cluster centers are marked with a cross in a circle in both graphics. The compact cluster on the right contains the real banknotes and the rest are the counterfeit banknotes.

k-means clustering
In k-means clustering, the real and false banknotes have been classified almost correctly. Only one wrong banknote was assigned to the blue cluster. It should be noted that the clustering took place with all six variables while the graphic only shows two dimensions.
Fuzzy c-means clustering
Here even more observations (in the lower middle) have been assigned to the cluster with the real banknotes. At first glance, fuzzy c-means clustering appears to be worse than k-means clustering. However, the size of the data points in the graphic indicates the value of the membership function: the larger the data point, the more clearly it is assigned to a cluster, the smaller the data point, the more uncertain the algorithm is about the assignment to a cluster. If you look at the data points in the lower center, you can see that the data points are very small in both the red and blue clusters. H. the values ​​of the membership function for both clusters are approx . So the fuzzy c-means algorithm is actually very uncertain about which cluster these data points are to be assigned to.
In fact, the real banknotes were printed from a template ( printing plate ) while the counterfeit banknotes come from different sources and thus probably also from different forged printing plates.
Result of the k-means cluster analysis (left) and the fuzzy c-means cluster analysis (right) on the Swiss banknote data.
Swiss kmeans.svg Swiss cmeans.svg

literature

  • Christiane Stutz: Application-specific fuzzy cluster methods (dissertation on artificial intelligence, TU Munich). Infix, Sankt Augustin 1999.

Web links

Individual evidence

  1. ^ JC Bezdek: Pattern recognition with fuzzy objective function algorithms . Plenum Press, New York 1981.
  2. ^ Bernhard Flury, Hans Riedwyl: Multivariate Statistics: A Practical Approach . 1st edition. Chapman & Hall, London 1988, ISBN 978-0-412-30030-1 .