# Maximum likelihood method

The maximum likelihood , shortly ML method , and maximum likelihood estimation ( maximum likelihood English for greatest plausibility , hence the maximum-plausibility ), method of maximum likelihood , maximum-density method or the maximum-density referred a parametric estimation method in statistics . To put it simply, that parameter is selected as the estimate according to whose distribution the realization of the observed data appears most plausible.

In the case of a probability function dependent on a parameter${\ displaystyle \ vartheta}$

${\ displaystyle \ rho \ colon \ Omega \ to [0; 1], \ quad x \ mapsto \ rho (x \ mid \ vartheta)}$

the following likelihood function for various parameters is considered for an observed output : ${\ displaystyle x}$

${\ displaystyle L \ colon \ Theta \ to [0; 1], \ quad \ vartheta \ mapsto \ rho (x \ mid \ vartheta).}$

The term for the result space and the parameter space (space of all possible parameter values). ${\ displaystyle \ Omega}$${\ displaystyle \ Theta}$

For a given value of the parameter , the likelihood function corresponds to the probability of observing the result . The maximum likelihood estimate is correspondingly referred to as that for which the likelihood function is maximum. In the case of continuous distributions, an analogous definition applies, only the probability function in this situation is replaced by the associated density function . In general, maximum likelihood methods can be defined for any statistical models as long as the corresponding distribution class is a dominated distribution class . ${\ displaystyle \ vartheta}$${\ displaystyle x}$${\ displaystyle \ vartheta}$

## motivation

Put simply, the maximum likelihood method means the following: When you conduct statistical studies, you are usually examining a sample of a certain number of objects in a population . Since the investigation of the entire population is impossible in most cases with regard to costs and effort, the important parameters of the population are unknown. Such parameters are e.g. B. the expected value or the standard deviation . However, since you need these parameters for the statistical calculations you want to carry out, you have to estimate the unknown parameters of the population using the known sample.

The maximum likelihood method is now used in situations in which the elements of the population can be interpreted as the realization of a random experiment that depends on an unknown parameter, but is uniquely determined and known except for this. Accordingly, the characteristic values ​​of interest depend exclusively on this unknown parameter and can therefore be represented as a function of it. The maximum likelihood estimator is the parameter that maximizes the probability of receiving the sample.

Because of its advantages over other estimation methods (for example the least squares method and the moment method ), the maximum likelihood method is the most important principle for obtaining estimation functions for the parameters of a distribution.

## A heuristic derivation

Consider the following example: There is an urn with a large number of balls, either black or red. Since the examination of all spheres seems practically impossible, a sample of ten spheres is drawn (for example with replacement). In this sample there are now one red and nine black balls. On the basis of this one sample, the true probability of drawing a red ball in the total population (urn) is now to be estimated.

Three likelihood functions for parameter p of a binomial distribution for different numbers k of red balls in a sample of n = 10 balls

The maximum likelihood method tries to create this estimate in such a way that the occurrence of our sample is most likely. For this purpose, one could try out at which estimated value the probability for our sample result becomes maximum.

If you try, for example, as an estimate for the probability of a red ball, you can use the binomial distribution to calculate the probability of the observed result (exactly one red ball): the result is . ${\ displaystyle 0 {,} 2}$${\ displaystyle p}$ ${\ displaystyle B (10; 0 {,} 2; 1)}$${\ displaystyle 0 {,} 2684}$

If you try it as an estimate for , i.e. calculates the probability that exactly one red ball will be drawn, the result is . ${\ displaystyle 0 {,} 1}$${\ displaystyle p}$${\ displaystyle B (10; 0 {,} 1; 1)}$${\ displaystyle 0 {,} 3874}$

With for , the probability that the observed result (exactly one red ball) in the sample was caused by a population probability for red balls of is greater than with . The maximum likelihood method would therefore be a better estimate of the proportion of red spheres in the population. It turns out that for (see red line for in the graph) the probability of the observed result is greatest. Therefore the maximum likelihood estimate is . It can be shown that, in general, with red balls in the sample, the maximum likelihood estimate of is. ${\ displaystyle 0 {,} 3874}$${\ displaystyle p = 0 {,} 1}$${\ displaystyle p = 0 {,} 1}$${\ displaystyle p = 0 {,} 2}$${\ displaystyle 0 {,} 1}$${\ displaystyle p}$${\ displaystyle p = 0 {,} 1}$${\ displaystyle k = 1}$${\ displaystyle 0 {,} 1}$${\ displaystyle p}$${\ displaystyle k}$${\ displaystyle k / 10}$${\ displaystyle p}$

## definition

The maximum likelihood method is based on a random variable whose density or probability function depends on an unknown parameter . If there is a simple random sample with realizations of independently and identically distributed random variables , the common density function or probability function can be factored as follows: ${\ displaystyle X}$ ${\ displaystyle f}$ ${\ displaystyle \ vartheta}$${\ displaystyle n}$ ${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$${\ displaystyle n}$ ${\ displaystyle X_ {1}, \ dotsc, X_ {n}}$

${\ displaystyle f (x_ {1}, x_ {2}, \ dotsc, x_ {n}; \ vartheta) = \ prod _ {i = 1} ^ {n} f (x_ {i}; \ vartheta)}$.

Instead of evaluating the density for any value for a fixed parameter , conversely, for observed and thus fixed realizations, the common density can be interpreted as a function of . This leads to the likelihood function${\ displaystyle \ vartheta}$${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$${\ displaystyle \ vartheta}$

${\ displaystyle L (\ vartheta) = \ prod _ {i = 1} ^ {n} f _ {\ vartheta} (x_ {i})}$.

The likelihood function is algebraically identical to the common density . If this function is maximized depending on${\ displaystyle f (x_ {1}, x_ {2}, \ dotsc, x_ {n}; \ vartheta)}$${\ displaystyle \ vartheta}$

${\ displaystyle {\ hat {\ vartheta}} _ {\ text {ML}} = {\ underset {\ vartheta \ in \ Theta} {\ arg \ max}} \, L (\ vartheta)}$,

this gives the maximum likelihood estimate for the unknown parameter . So the value of is searched for at which the sample values ​​have the greatest density or probability function. It is obvious to regard a parameter value as the more plausible the higher the likelihood. In this sense, the maximum likelihood estimator is the most plausible parameter value for realizing the random variables . Is differentiable, then the maximum can be determined by the first derivative according to forms and this is then zero. Since this can be very time-consuming for density functions with complicated exponent expressions, the logarithmic likelihood function or logarithmic likelihood function (abbreviated: log-likelihood function ) is often used because, due to the monotony of the logarithm, it has its maximum at the same point as has the non-logarithmic density function, but is easier to calculate: ${\ displaystyle \ vartheta}$${\ displaystyle \ vartheta}$${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$${\ displaystyle \ vartheta}$${\ displaystyle x_ {1}, \ dotsc, x_ {n}}$${\ displaystyle X}$${\ displaystyle L (\ cdot)}$${\ displaystyle \ vartheta}$

${\ displaystyle \ ell (\ vartheta) = \ log \ left (\ prod _ {i = 1} ^ {n} f _ {\ vartheta} (x_ {i}) \ right) = \ sum _ {i = 1} ^ {n} \ underbrace {\ log f _ {\ vartheta} (x_ {i})} _ {= \ ell _ {i} (\ vartheta)} = \ sum _ {i = 1} ^ {n} \ ell _ {i} (\ vartheta)}$,

where are the individual contributions to the log-likelihood function. ${\ displaystyle \ ell _ {i} (\ vartheta)}$

## Examples

### Discrete distribution, continuous parameter space

The number of calls to two operators in one hour in a call center can be calculated using a Poisson distribution

${\ displaystyle X_ {1} \ sim {\ mathcal {P}} (\ lambda) \;}$ and ${\ displaystyle \; X_ {2} \ sim {\ mathcal {P}} (\ lambda)}$

be modeled. The first operator receives three calls an hour and the second five calls an hour, independently of one another. The likelihood function for the unknown parameter is given as ${\ displaystyle \ lambda}$

${\ displaystyle L (\ lambda) = P (\ {X_ {1} = 3 \} \ cap \ {X_ {2} = 5 \}) = P (X_ {1} = 3) \ cdot P (X_ { 2} = 5).}$
Likelihood function in the example opposite

Put the values ​​in the probability function

${\ displaystyle P (X = x) = {\ frac {1} {x!}} \ lambda ^ {x} \ exp (- \ lambda) \ quad x = 0,1,2, \ ldots}$

one, so follows

${\ displaystyle L (\ lambda) = {\ frac {\ lambda ^ {3}} {3!}} \ exp (- \ lambda) \; {\ frac {\ lambda ^ {5}} {5!}} \ exp (- \ lambda) = {\ frac {\ lambda ^ {8}} {3! 5!}} \ exp (-2 \ lambda)}$.

The first derivative of the likelihood function is given by

{\ displaystyle {\ begin {aligned} \ left. {\ frac {\ rm {d}} {\ rm {d \ lambda}}} L (\ lambda) \ right | _ {\ hat {\ lambda}} & = {\ frac {1} {3! 5!}} \ left (8 \ lambda ^ {7} \ exp (-2 \ lambda) -2 \ lambda ^ {8} \ exp (-2 \ lambda) \ right ) \\ & = {\ frac {2 \ lambda ^ {7} \ exp (-2 \ lambda)} {3! 5!}} (4- \ lambda) = 0 \ end {aligned}}}

and the zeros to and . The likelihood function has a maximum only for and this is the maximum likelihood estimate. ${\ displaystyle {\ hat {\ lambda}} = 0 \,}$${\ displaystyle {\ hat {\ lambda}} = 4}$${\ displaystyle {\ hat {\ lambda}} _ {\ text {ML}} = 4}$

In the general case, with operators receiving calls every hour, the likelihood function results as ${\ displaystyle n}$${\ displaystyle x_ {i}}$

${\ displaystyle L (\ lambda) = {\ frac {1} {\ prod _ {i = 1} ^ {n} x_ {i}!}} \ lambda ^ {\ sum _ {i = 1} ^ {n } x_ {i}} \ exp (-n \ lambda)}$

and the log-likelihood function as

${\ displaystyle \ ell (\ lambda) = \ log (L (\ lambda)) = \ sum _ {i = 1} ^ {n} x_ {i} \ log (\ lambda) - \ log \ left (\ prod _ {i = 1} ^ {n} x_ {i}! \ right) -n \ lambda.}$

The derivation after gives ${\ displaystyle \ lambda}$

${\ displaystyle \ left. {\ frac {\ rm {d}} {\ rm {d \ lambda}}} \ ell (\ lambda) \ right | _ {{\ hat {\ lambda}} _ {\ text { ML}}} = {\ frac {\ sum _ {i = 1} ^ {n} x_ {i}} {{\ hat {\ lambda}} _ {\ text {ML}}}} - n \; { \ overset {\ mathrm {!}} {=}} \; 0}$

and after reshaping the maximum likelihood estimator results as

${\ displaystyle {\ hat {\ lambda}} _ {\ text {ML}} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} x_ {i} = {\ overline {x}}}$

and the associated estimator as

${\ displaystyle \ Lambda = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} X_ {i} = {\ overline {X}}.}$

### Discrete distribution, finite parameter space

An urn contains balls that are either red or black. The exact number of red balls is not known. Balls are drawn one after the other and put back into the urn. Observe (first ball is red), (second ball is red), (third ball is black) and (fourth ball is red). ${\ displaystyle N = 8}$${\ displaystyle M \ in \ {0,1, \ dotsc, 8 \}}$${\ displaystyle n = 4}$${\ displaystyle x_ {1} = 1}$${\ displaystyle x_ {2} = 1}$${\ displaystyle x_ {3} = 0}$${\ displaystyle x_ {4} = 1}$

We are now looking for the most plausible composition of the balls in the urn according to the maximum likelihood principle.

In every move the probability of drawing a red ball is the same . Due to the independence of the drawings, the probability of the observed result and thus the associated likelihood function depending on the unknown parameter is given by ${\ displaystyle {\ frac {M} {N}}}$${\ displaystyle M}$

${\ displaystyle L (M) = \ left ({\ frac {M} {N}} \ right) ^ {3} \ left (1 - {\ frac {M} {N}} \ right) = {\ frac {1} {N ^ {4}}} M ^ {3} (NM) = {\ frac {1} {8 ^ {4}}} M ^ {3} (8-M).}$

The following function values ​​result:

 ${\ displaystyle M}$ ${\ displaystyle 0}$ ${\ displaystyle 1}$ ${\ displaystyle 2}$ ${\ displaystyle 3}$ ${\ displaystyle 4}$ ${\ displaystyle 5}$ ${\ displaystyle 6}$ ${\ displaystyle 7}$ ${\ displaystyle 8}$ ${\ displaystyle L (M)}$ 0 0.002 0.012 0.033 0.063 0.092 0.105 0.084 0

It follows that the likelihood function is maximal for . This means that the most plausible parameter value for realizing three red balls with four draws is the estimated value using the maximum likelihood method. ${\ displaystyle L (M)}$${\ displaystyle M = 6}$${\ displaystyle M = 6}$

### Continuous distribution, continuous parameter space

Let be realizations of a random sample of a normal distribution with unknown expected value and unknown variance . The density function for each individual realization is then given by ${\ displaystyle x_ {1: n}}$ ${\ displaystyle X_ {1: n}}$${\ displaystyle {\ mathcal {N}} (\ mu, \ sigma ^ {2})}$ ${\ displaystyle \ mu \ in (- \ infty, \ infty)}$${\ displaystyle \ sigma ^ {2}> 0}$

${\ displaystyle f \ left (x_ {i} \ mid \ mu, \ sigma ^ {2} \ right) = {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ exp {\ left (- {\ frac {1} {2 \ sigma ^ {2}}} (x_ {i} - \ mu) ^ {2} \ right)}}$.

Then

${\ displaystyle L (\ vartheta) = \ prod _ {i = 1} ^ {n} f _ {\ vartheta} \ left (x_ {i} \ right) = \ prod _ {i = 1} ^ {n} { \ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ exp {\ left (- {\ frac {1} {2 \ sigma ^ {2}}} (x_ {i} - \ mu) ^ {2} \ right)} = {\ frac {1} {\ left (2 \ pi \ sigma ^ {2} \ right) ^ {n / 2}}} \ exp \ left (- {\ frac {1} {2 \ sigma ^ {2}}} \ sum _ {i = 1} ^ {n} (x_ {i} - \ mu) ^ {2} \ right)}$.

the likelihood function of . The log-likelihood function (also called logarithmic plausibility function) results ${\ displaystyle \ vartheta = (\ mu, \ sigma ^ {2}) \ in \ Theta = (- \ infty, \ infty) \ times (0, \ infty)}$

${\ displaystyle \ ell (\ vartheta) = \ log L (\ vartheta) = - {\ frac {n} {2}} \ log \ left (2 \ pi \ sigma ^ {2} \ right) - {\ frac {1} {2 \ sigma ^ {2}}} \ sum _ {i = 1} ^ {n} (x_ {i} - \ mu) ^ {2}}$.

If one forms the partial derivatives of after and (one forms the score functions ) and one sets both expressions equal to zero, then one obtains the two likelihood equations ${\ displaystyle \ ell (\ vartheta)}$${\ displaystyle \ mu}$${\ displaystyle \ sigma ^ {2}}$

${\ displaystyle \ left. {\ frac {\ partial} {\ partial \ mu}} \ ell (\ vartheta) \ right | _ {{\ hat {\ mu}} _ {\ text {ML}}} = - {\ frac {1} {\ sigma ^ {2}}} \ sum _ {i = 1} ^ {n} (x_ {i} - {\ hat {\ mu}} _ {\ text {ML}}) \ cdot (-1) \; {\ overset {\ mathrm {!}} {=}} \; 0}$

and

${\ displaystyle \ left. {\ frac {\ partial} {\ partial \ sigma ^ {2}}} \ ell (\ vartheta) \ right | _ {{{\ hat {\ sigma}} ^ {2}} _ {\ text {ML}}} = - {\ frac {n} {2 {{\ hat {\ sigma}} ^ {2}} _ {\ text {ML}}}} + {\ frac {1} { 2 ({{\ hat {\ sigma}} ^ {2}} _ {\ text {ML}}) ^ {2}}} \ sum _ {i = 1} ^ {n} (x_ {i} - \ mu) ^ {2} \; {\ overset {\ mathrm {!}} {=}} \; 0}$.

Now Solving for and one obtains the two maximum likelihood estimates ${\ displaystyle {\ hat {\ mu}} _ {\ text {ML}}}$${\ displaystyle {{\ hat {\ sigma}} ^ {2}} _ {\ text {ML}}}$

${\ displaystyle {\ hat {\ mu}} _ {\ text {ML}} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} x_ {i} = {\ overline {x}}}$

and

${\ displaystyle {{\ hat {\ sigma}} ^ {2}} _ {\ text {ML}} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} (x_ {i} - \ mu) ^ {2}}$.

If you start from the random variables and not from their realizations , you get the sample mean${\ displaystyle X_ {1}, \ ldots, X_ {n}}$${\ displaystyle x_ {1}, \ ldots, x_ {n}}$

${\ displaystyle {\ hat {\ mu}} _ {\ text {ML}} = {\ overline {X}} = {\ frac {1} {n}} \ sum \ limits _ {i = 1} ^ { n} X_ {i}}$

and the sample variance

${\ displaystyle {\ hat {\ sigma}} _ {\ text {ML}} ^ {2} = {\ tilde {S}} ^ {2} = {\ frac {1} {n}} \ sum _ { i = 1} ^ {n} (X_ {i} - {\ overline {X}}) ^ {2}}$

as a maximum likelihood estimator.

In fact, the function has its maximum at this point (see population variance estimation ). ${\ displaystyle L (\ vartheta)}$

For the expectation of results ${\ displaystyle {\ hat {\ mu}} _ {\ text {ML}}}$

${\ displaystyle \ operatorname {E} ({\ hat {\ mu}} _ {\ text {ML}}) = \ mu}$,

that is, the maximum likelihood estimator is unbiased for the unknown parameter . ${\ displaystyle {\ hat {\ mu}} _ {\ text {ML}}}$${\ displaystyle \ mu}$

One can show that for the expected value of ${\ displaystyle {\ hat {\ sigma}} _ {\ text {ML}} ^ {2}}$

${\ displaystyle \ operatorname {E} ({\ hat {\ sigma}} _ {\ text {ML}} ^ {2}) = {\ frac {n-1} {n}} \ sigma ^ {2}}$

applies (see unknown expected value ). The maximum likelihood estimator for the unknown scalar disturbance variable variance is therefore not true to expectations. However, one can show that the maximum likelihood estimator is asymptotically unbiased for . ${\ displaystyle {\ hat {\ sigma}} _ {\ text {ML}} ^ {2}}$ ${\ displaystyle \ sigma ^ {2}}$${\ displaystyle {\ hat {\ sigma}} _ {\ text {ML}} ^ {2}}$${\ displaystyle \ sigma ^ {2}}$

## Historical development

The maximum likelihood method goes back to Ronald Aylmer Fisher , who initially developed it with relative ignorance of preliminary work by Gauss in works from 1912, 1921 and finally in 1922 under the name later known. The main results were also derived in 1908 by Francis Ysidro Edgeworth .

## Maximum likelihood estimation

In statistics, the maximum likelihood estimate , or MLS for short , is a parameter estimate that was calculated using the maximum likelihood method. In the English specialist literature, the abbreviation MLE (for maximum likelihood estimation or maximum likelihood estimator ) is very common. An estimate in which prior knowledge is incorporated in the form of an a priori probability is called a maximum a posteriori estimate ( MAP for short ).

## Properties of maximum likelihood estimators

The special quality of maximum likelihood estimators is expressed in the fact that they are usually the most efficient method for estimating certain parameters.

### existence

Under certain regularity conditions it can be proven that maximum likelihood estimators exist, which is not obvious due to its implicit definition as the unambiguous maximum point of an unspecified probability function. The prerequisites required for this proof consist in principle exclusively of assumptions about the interchangeability of integration and differentiation , which is fulfilled in most of the models considered.

### Asymptotic normality

If maximum likelihood estimators exist, then they are asymptotically normally distributed . Formally speaking, let the maximum likelihood estimator for a parameter and expected Fisher information . Then applies ${\ displaystyle {\ hat {\ vartheta}} _ {\ text {ML}}}$${\ displaystyle \ vartheta}$${\ displaystyle I ^ {*} (\ vartheta) = \ operatorname {E} (I (\ vartheta))}$

${\ displaystyle {\ sqrt {I ^ {*} (\ vartheta)}} ({\ hat {\ vartheta}} _ {\ text {ML}} - \ vartheta) \; {\ stackrel {a} {\ sim }} \; {\ mathcal {N}} (0,1)}$

or.

${\ displaystyle {\ hat {\ vartheta}} _ {\ text {ML}} \; {\ stackrel {a} {\ sim}} \; {\ mathcal {N}} (\ vartheta, (I ^ {* } (\ vartheta)) ^ {- 1})}$.

### General tests

How the three tests work in the context of the maximum likelihood method.

The convergence of the maximum likelihood estimator against a normal distribution allows the derivation of general tests for checking models and coefficients: ${\ displaystyle {\ hat {\ vartheta}} _ {\ text {ML}}}$

The graphic on the right shows how the tests work: The likelihood ratio test compares the values ​​of the likelihood functions with one another, the Wald test checks the distance between the estimated parameter and the specified parameter, and the score test determines whether the derivation of the Likelihood function is zero.

Since these tests are only asymptotically valid, there are often tests with better optimality properties for “small” sample sizes .

#### Likelihood Quotient Test

The likelihood-ratio test, it is checked whether two hierarchically nested models ( English nested models ) were significantly different from each other. If a parameter vector, two parameter spaces ( reduced model, full model) and the likelihood function, then under the null hypothesis ( vs. ) ${\ displaystyle \ vartheta}$${\ displaystyle \ Theta _ {0} \ subset \ Theta _ {1}}$${\ displaystyle \ Theta _ {0}}$${\ displaystyle \ Theta _ {1}}$${\ displaystyle L (\ vartheta)}$${\ displaystyle H_ {0} \ colon \ vartheta \ in \ Theta _ {0}}$${\ displaystyle H_ {1} \ colon \ vartheta \ in \ Theta _ {1}}$

${\ displaystyle LR = -2 \ log \ left ({\ frac {\ max _ {\ Theta _ {0}} L (\ vartheta)} {\ max _ {\ Theta _ {1}} L (\ vartheta) }} \ right) \; {\ stackrel {a} {\ sim}} \; \ chi ^ {2} (\ dim (\ Theta _ {1}) - \ dim (\ Theta _ {0}))}$.

Rejection of the null hypothesis means that the “full model” (the model under the alternative hypothesis ) provides a significantly better explanation than the “reduced model” (the model under the null hypothesis or null model ).

#### Forest test

While the likelihood ratio test compares models, the Wald test targets individual coefficients (univariate) or groups of coefficients (multivariate). Asymptotically and below the null hypothesis follows ${\ displaystyle H_ {0}}$

${\ displaystyle W = {\ sqrt {I ({\ hat {\ vartheta}} _ {\ text {ML}})}} ({\ hat {\ vartheta}} _ {\ text {ML}} - \ vartheta _ {0}) \; {\ stackrel {a, H_ {0}} {\ sim}} \; {\ mathcal {N}} (0,1)}$.

Ie the forest test statistic is under the above. Requirements with standard normal distribution. Here denotes the Fisher information. ${\ displaystyle I (\ cdot)}$

### Akaike information criterion

The maximum likelihood method is also closely linked to the Akaike information criterion (AIC). Hirotsugu Akaike showed that the maximum of the likelihood function is a biased estimate of the Kullback-Leibler divergence , the distance between the true model and the maximum likelihood model. The larger the value of the likelihood function, the closer the model is to the true model; the model that has the lowest AIC value is selected. The asymptotic fidelity is precisely the number of parameters to be estimated. With the Akaike information criterion, in contrast to the likelihood quotient, Wald and score tests, you can also compare non-nested ML models.

The desirable properties of the maximum likelihood approach are based on the decisive assumption about the data-generating process, that is, on the assumed density function of the examined random variable. The disadvantage of the maximum likelihood method is that a concrete assumption has to be made about the entire distribution of the random variable. However, if this is violated, the maximum likelihood estimators may be inconsistent.

Only in some cases is it irrelevant whether the random variable actually obeys the assumed distribution, but this is not generally the case. Estimators obtained using maximum likelihood that are consistent even if the underlying distribution assumption is violated are so-called pseudo-maximum likelihood estimators.

Maximum likelihood estimators can have efficiency problems and systematic errors in small samples.

If the data is not random, other methods can often be used to determine better parameters. This can play a role in quasi Monte Carlo analyzes, for example, or if the data has already been averaged.

## Application example: maximum likelihood in molecular phylogeny

The maximum likelihood criterion is considered to be one of the standard methods for calculating phylogenetic trees in order to investigate relationships between organisms - mostly using DNA or protein sequences. As an explicit method, Maximum-Likelihood enables the use of different evolution models, which flow into the family tree calculations in the form of substitution matrices. Either empirical models are used (protein sequences) or the probabilities of point mutations between the different nucleotides are estimated using the data set and optimized with regard to the likelihood value ( ) (DNA sequences). In general, ML is considered to be the most reliable and least artifact-prone method among phylogenetic tree construction methods. However, this requires careful taxon "sampling" and usually a complex evolution model. ${\ displaystyle - \ ln L}$

## literature

• Schwarze, Jochen: Basics of Statistics - Volume 2: Probability Calculation and Inductive Statistics , 6th edition, Berlin; Herne: Verlag Neue Wirtschaftsbriefe, 1997
• Blobel, Volker and Lohrmann, Erich: Statistical and numerical methods of data analysis . Teubner Study Books, Stuttgart; Leipzig 1998, ISBN 978-3-519-03243-4 .

## Individual evidence

1. Alice Zheng, Amanda Casari: Feature Construction for Machine Learning: Principles and Techniques of Data Preparation
2. The German Standards Committee issued a circular in 1954 the cumbersome term "method of maximum likelihood in Gaussian Fisher's sense" has proposed
3. George G. Judge, R. Carter Hill, W. Griffiths, Helmut Lütkepohl , TC Lee. Introduction to the Theory and Practice of Econometrics. 2nd Edition. John Wiley & Sons, New York / Chichester / Brisbane / Toronto / Singapore 1988, ISBN 0-471-62414-4 , p. 64.
4. ^ Leonhard Held and Daniel Sabanés Bové: Applied Statistical Inference: Likelihood and Bayes. Springer Heidelberg New York Dordrecht London (2014). ISBN 978-3-642-37886-7 , p. 14.
5. ^ RA Fisher: An absolute criterion for fitting frequency curves . In: Messenger of Math. No. 41, p. 155, 1912. JSTOR 2246266 ( online )
6. ^ John Aldrich: RA Fisher and the Making of Maximum Likelihood 1912–1922 . In: Statistical Science. Volume 12, No. 3, pp. 162-176, 1997, doi : 10.1214 / ss / 1030037906 , JSTOR 2246367 .