EM algorithm

EM clustering for old faithful corruption data. The random model ( the spheres appear flat and wide due to the scaling of the axes ) is adapted to the observed data. The two typical modes of the geyser are clearly recognizable . The model changes significantly in the first iterations, then the changes become smaller and approach a limit value . Visualization generated with ELKI .

The expectation-maximization algorithm ( English expectation-maximization algorithm , hence also expectation-maximization algorithm , rarely also estimation-maximization algorithm , EM algorithm for short ) is an algorithm of mathematical statistics . The core idea of the EM algorithm is to start with a randomly selected model and alternately assign the data to the individual parts of the model ( expectation step , in short: E step ) and the parameters of the model to the latest assignment ( maximization step , short: M-step ).

The quality of the result is improved in both steps: In the E step, the points are assigned better, in the M step the model is changed so that it fits the data better. If there is no longer any significant improvement, the process is ended.

The procedure typically only finds “local” optima. As a result, it is often necessary to call the procedure several times and select the best result found in this way.

Mathematical formulation

functionality

There are objects with a certain number of properties. The properties take on random values. Some properties can be measured, but others cannot. From a formal point of view, the objects are instances of a multidimensional random variable that is subject to an unknown probability distribution ; some dimensions are "observable", others are "hidden". The aim is to determine the unknown probability distribution. ${\ displaystyle X}$ ${\ displaystyle p (x)}$

First of all, it is assumed that a few parameters are known. Usually one chooses as a mixed distribution , so as a weighted sum of so many standard distributions as there are observable characteristics. For example , if the normal distribution is selected as the standard distribution , the unknown parameters are the expected value and the variance . The aim is now to determine the parameters of the assumed probability distribution. If you choose a mixed distribution, the weighting factors also include the sum. ${\ displaystyle p (x)}$ ${\ displaystyle p}$ ${\ displaystyle \ mu}$ ${\ displaystyle \ sigma ^ {2}}$ ${\ displaystyle \ Theta}$ ${\ displaystyle \ Theta}$

Usually, such a problem is addressed using the maximum likelihood method . However, this is not possible here, as the parameters you are looking for depend on the hidden properties - and these are unknown. In order to still achieve the goal, a method is required that estimates the hidden properties with the end parameters sought. The EM algorithm fulfills this task.

The EM algorithm works iteratively and performs two steps in each pass:

E-Step: Appreciate hidden properties . First, the hidden properties are estimated from the final parameters determined in the previous run and the observed data . For this purpose, the so-called Q function is used, which calculates a preliminary expected value of the distribution sought. ${\ displaystyle Y}$ ${\ displaystyle \ Theta}$ ${\ displaystyle X}$
M step: Determine the final parameters. Now that the hidden properties have been estimated, the maximum likelihood method is used to determine the parameters actually sought. ${\ displaystyle \ Theta}$

The algorithm ends when the specific parameters no longer change significantly.

It has been proven that the sequence of certain converges , that is, the algorithm terminates in any case. However, the result usually only forms a local optimum, which is good, but not necessarily optimal. It is therefore necessary to execute the algorithm with many different starting values in order to find an end result that is as close as possible to the optimum. ${\ displaystyle \ Theta}$

Formulation as a random experiment

In purely formal terms, the EM algorithm assumes that the values of the observed stochastic quantity come about in the following way: First select one of the incoming random variables and use its value as the final result. This means that exactly one weight takes on the value one and all others are zero. However, this is normally no longer the case when the weights are approximated by the EM algorithm. The probability density of a target value can be in normal distribution assumption and constant variance of the random variables represented as:

{\ displaystyle p (y_ {i} \ mid h) = {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ operatorname {exp} \ left (- {\ frac {1 } {2 \ sigma ^ {2}}} \ sum _ {j = 1} ^ {n} w_ {ij} (y_ {i} - \ mu _ {j}) ^ {2} \ right)}

The terms used have the following meanings:

${\ displaystyle w_ {ij}}$ : Weight of the -th random variable for the -th value of the target variable ${\ displaystyle j}$ ${\ displaystyle i}$
${\ displaystyle n}$ : Number of weights
${\ displaystyle y_ {i}}$ : Value of the -th target variable ${\ displaystyle i}$
${\ displaystyle h}$ : Sample
${\ displaystyle \ mu _ {j}}$ : Expected value of the -th random variable ${\ displaystyle j}$
${\ displaystyle \ sigma ^ {2}}$ : Variance

EM clustering

The EM clustering is a method of cluster analysis , the data with a Gaussian mixture model ( English gaussian mixture model , in short: GMM ) - so as a superposition of normal distributions - represented. This model is initialized randomly or heuristically and then refined using the general EM principle.

EM clustering consists of several iterations of the steps of expectation and maximization .

This must be freely selected in the initialization step . Assume for this that exactly one random variable (exactly one random training instance ) corresponds to this expected value , i. H. put . ${\ displaystyle \ mu}$ ${\ displaystyle y_ {k}}$ ${\ displaystyle \ mu}$ ${\ displaystyle \ mu _ {1, \ mathrm {appr}} = y_ {k}}$
In the expectation step, the expected values of the weights are calculated under the assumption that the expected values of the incoming random variables correspond to those calculated in the maximization step. However, this is only possible if it is not the first iteration.
The expected values can be represented as ${\ displaystyle \ operatorname {E} [w_ {ij}] = p (X_ {j} = y_ {i} | \ mu _ {j} = \ mu _ {j, \ mathrm {appr}}) / \ sum _ {k = 1} ^ {n} p (X_ {k} = y_ {i} \ mid \ mu _ {k} = \ mu _ {k, \ mathrm {appr}})}$
In the maximization step, the expected values of the probability distributions of the individual random variables are determined, for which the probability of the occurrence of the sample result is maximized. This is done under the assumption that the exact values of the weights correspond to their expected values ( maximum likelihood algorithm ). The expected values estimated in this way result from assumption of normal distribution through ${\ displaystyle \ mu _ {j, \ mathrm {appr}} = {\ frac {1} {N}} \ sum _ {i = 1} ^ {N} \ operatorname {E} [w_ {ij}] y_ {i}}$

where denotes the size of the training data set.

{\ displaystyle N}

By repeating the expectation and maximization steps, the parameters are brought closer to their actual values.

Advantages of EM clustering

Comparison of k-Means and EM on an artificial data set, visualized with ELKI . By using variances, EM can accurately describe the different normal distributions, while k-means divides the data into unfavorable Voronoi cells. The cluster centers are indicated by larger, paler symbols.

Mathematical model of the data with normal distributions
Allows clusters of different sizes (in the sense of dispersion - k-means ignores the variance of the clusters).
Can recognize and represent correlated clusters through the covariance matrix
Can handle incomplete observations

K-Means as an EM method

The k-means algorithm can also be understood as an EM algorithm that models the data as Voronoi cells .

An EM variant of the k-means algorithm looks like this:

In the initialization step, the cluster centers are selected randomly or with a heuristic. ${\ displaystyle \ mu _ {i}}$
In the expectation step, data objects are assigned to their next cluster center, i. H.

{\ displaystyle \ operatorname {E} [w_ {ij}] = {\ begin {cases} 1, & {\ text {falls}} \ quad || \ mu _ {i} -X_ {j} || = \ min _ {i} {|| \ mu _ {i} -X_ {j} ||} \\ 0, & {\ text {otherwise}} \ end {cases}}}

In the maximization step, the cluster centers are recalculated as the mean of the data points assigned to them:
${\ displaystyle \ mu _ {j} = {\ frac {1} {\ sum _ {i} ^ {n} \ operatorname {E} [w_ {ij}]}} \ sum _ {i = 1} ^ { n} \ operatorname {E} [w_ {ij}] y_ {i}}$

Further instances of the EM algorithm

Estimating the parameters of a mixed distribution
Baum-Welch algorithm for estimating the parameters of a hidden Markov model
Learning a Bayesian network with hidden variables

Proof for the maximization step assuming normal distribution

The aim is to prove: ${\ displaystyle {\ underset {\ mu _ {k}} {\ mathrm {arg \, max}}} \ prod _ {i = 1} ^ {m} P \ left (y \ mid h \ right) = { \ frac {1} {m}} \ sum _ {i = 1} ^ {m} {\ frac {1} {2 \ sigma ^ {2}}} \ operatorname {E} [w_ {ik}] y_ { i}}$

Instead of maximizing, one can also maximize, since the logarithm is a strictly monotonically increasing function. ${\ displaystyle P (y \ mid h)}$ ${\ displaystyle \ ln P (y \ mid h)}$

${\ displaystyle {\ underset {\ mu} {\ mathrm {arg \, max}}} \ ln \ prod _ {i = 1} ^ {m} P (y \ mid h)}$	${\ displaystyle = {\ underset {\ mu} {\ mathrm {arg \, max}}} \ left (\ sum _ {i = 1} ^ {m} \ left (\ ln {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} - \ sum _ {j = 1} ^ {n} {\ frac {1} {2 \ sigma ^ {2}}} \ operatorname {E} [w_ {ij}] (y_ {i} - \ mu _ {j}) ^ {2} \ right) \ right)}$
	${\ displaystyle = {\ underset {\ mu} {\ mathrm {arg \, max}}} \ left (m \ ln {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} - {\ frac {1} {2 \ sigma ^ {2}}} \ sum _ {i = 1} ^ {m} \ sum _ {j = 1} ^ {n} \ operatorname {E} [w_ {ij }] (y_ {i} - \ mu _ {j}) ^ {2} \ right)}$ .

The minuend is a constant, so it is sufficient to minimize the subtrahend:

{\ displaystyle {\ underset {\ mu} {\ mathrm {arg \, min}}} \ left ({\ frac {1} {2 \ sigma ^ {2}}} \ sum _ {i = 1} ^ { m} \ sum _ {j = 1} ^ {n} \ operatorname {E} \ left [w_ {ij} \ right] \ left (y_ {i} - \ mu _ {j} \ right) ^ {2} \ right) =}

{\ displaystyle {\ underset {\ mu} {\ mathrm {arg \, min}}} \ left (\ sum _ {i = 1} ^ {m} \ sum _ {j = 1} ^ {n} \ operatorname {E} [w_ {ij}] (y_ {i} - \ mu _ {j}) ^ {2} \ right)}

.

A quadratic sum always gives a non-negative value. Therefore the absolute minimum is bounded by 0 and thus also a local minimum. Set the 1st derivative for this term after each to 0: ${\ displaystyle t}$ ${\ displaystyle \ mu _ {k}}$

{\ displaystyle {\ frac {\ mathrm {d} t} {\ mathrm {d} \ mu _ {k}}} = - 2 \ sum _ {i = 1} ^ {m} \ operatorname {E} \ left [w_ {ik} \ right] \ left (y_ {i} - \ mu _ {k} \ right) = 0}

,

because the derivation of all summands of the sum over are , except for . Hence: ${\ displaystyle j}$ ${\ displaystyle 0}$ ${\ displaystyle j = k}$

{\ displaystyle \ sum _ {i = 1} ^ {m} \ operatorname {E} [w_ {ik}] y_ {i} = m \ mu _ {k} \ quad \ Rightarrow \ quad \ mu _ {k} = {\ frac {1} {m}} \ sum _ {i = 1} ^ {m} \ operatorname {E} \ left [w_ {ik} \ right] y_ {i}}

.

q. e. d.

literature

AP Dempster , NM Laird, DB Rubin: Maximum-Likelihood from incomplete data via the EM algorithm. In: Journal of the Royal Statistical Society. 1977, pp. 1-38 JSTOR 2984875 .
Tom M. Mitchell: Machine Learning . The Mc-Graw-Hill Companies, 1997.
Richard O. Duda et al .: Pattern Classification. 2nd Edition. John Wiley & Sons, New York 2012.
Stuart Russell , Peter Norvig : Artificial Intelligence: A Modern Approach . 2nd Edition. Prentice Hall, 2002.
Stuart Russell, Peter Norvig: Artificial Intelligence: A Modern Approach . Pearson Studies, 2004, ISBN 3-8273-7089-2 . (German translation of the 2nd edition)