# Classic linear model of normal regression

In statistics , classical normal regression is a regression which, in addition to the Gauss-Markov assumptions, includes the assumption of the normal distribution of the disturbance variables . The associated model is called the classic linear model of normal regression . The assumption of the normal distribution of the disturbance variables is required in order to carry out statistical inference, i.e. H. it is needed to be able to calculate confidence intervals and to test general linear hypotheses . In addition, further properties of the KQ estimate can be derived from the assumption of normal distribution.

## Starting position

As a starting point, we consider a typical multiple linear regression model with given data for statistical units . The relationship between the dependent variable and the independent variables can be shown as follows ${\ displaystyle \ {y_ {t}, x_ {tk} \} _ {t = 1, \ dots, T, k = 1, \ dots, K}}$${\ displaystyle T}$

${\ displaystyle y_ {t} = x_ {t1} \ beta _ {1} + x_ {t2} \ beta _ {2} + \ ldots + x_ {tK} \ beta _ {K} + \ varepsilon _ {t} = \ mathbf {x} _ {t} ^ {\ top} {\ boldsymbol {\ beta}} + \ varepsilon _ {t}, \ quad t = 1,2, \ dotsc, T}$.

In matrix notation too

${\ displaystyle {\ begin {pmatrix} y_ {1} \\ y_ {2} \\\ vdots \\ y_ {T} \ end {pmatrix}} _ {(T \ times 1)} \; = \; { \ begin {pmatrix} x_ {11} & x_ {12} & \ cdots & x_ {1k} & \ cdots & x_ {1K} \\ x_ {21} & x_ {22} & \ cdots & x_ {2k} & \ cdots & x_ {2K } \\\ vdots & \ vdots & \ ddots & \ vdots & \ ddots & \ vdots \\ x_ {T1} & x_ {T2} & \ cdots & x_ {Tk} & \ cdots & x_ {TK} \ end {pmatrix}} _ {(T \ times K)} \; \ cdot \; {\ begin {pmatrix} \ beta _ {1} \\\ beta _ {2} \\\ vdots \\\ beta _ {K} \ end { pmatrix}} _ {(K \ times 1)} \; + \; {\ begin {pmatrix} \ varepsilon _ {1} \\\ varepsilon _ {2} \\\ vdots \\\ varepsilon _ {T} \ end {pmatrix}} _ {(T \ times 1)}}$

or in compact notation

${\ displaystyle \ mathbf {y} = \ mathbf {X} {\ boldsymbol {\ beta}} + {\ boldsymbol {\ varepsilon}}}$.

Here represents a vector of unknown parameters that must be estimated using the data. ${\ displaystyle {\ boldsymbol {\ beta}}}$

## Classic linear model

The multiple linear regression model

${\ displaystyle \ mathbf {y} = \ mathbf {X} {\ boldsymbol {\ beta}} + {\ boldsymbol {\ varepsilon}}}$

is called "classic" if the following assumptions apply

• A1: The disturbance variables have an expected value of zero: which means that we can assume that our model is correct on average.${\ displaystyle \ operatorname {E} ({\ boldsymbol {\ varepsilon}}) = \ mathbf {0} \}$
• A2: The disturbance variables are uncorrelated: and show a homogeneous variance . Both together result in:${\ displaystyle \ operatorname {Cov} (\ varepsilon _ {i}, \ varepsilon _ {j}) = \ operatorname {E} [(\ varepsilon _ {i} - \ operatorname {E} (\ varepsilon _ {i}) )) ((\ varepsilon _ {j} - \ operatorname {E} (\ varepsilon _ {j}))] = \ operatorname {E} (\ varepsilon _ {i} \ varepsilon _ {j}) = 0 \ quad \ forall i \ neq j, \; i = 1, \ ldots, n, \; j = 1, \ ldots, n}$${\ displaystyle {\ mbox {Cov}} ({\ boldsymbol {\ varepsilon}}) = \ sigma ^ {2} \ mathbf {I} _ {T}}$
• A3: The data matrix is non-stochastic and has full column rank ${\ displaystyle {\ mbox {Rank}} (\ mathbf {X}) = K}$

The assumptions A1 – A3 can be summarized as . Instead of considering the variances and covariances of the disturbance variables individually, they are summarized in the following variance-covariance matrix : ${\ displaystyle {\ boldsymbol {\ varepsilon}} \ sim (\ mathbf {0}, \ sigma ^ {2} \ mathbf {I} _ {n})}$

{\ displaystyle {\ begin {aligned} {\ mbox {Cov}} ({\ boldsymbol {\ varepsilon}}) & = \ operatorname {E} \ left (({\ boldsymbol {\ varepsilon}} - \ underbrace {\ operatorname {E} ({\ boldsymbol {\ varepsilon}})} _ {= \ mathbf {0} \; {\ text {from A1}}}) ({\ boldsymbol {\ varepsilon}} - \ underbrace {\ operatorname {E} ({\ boldsymbol {\ varepsilon}})} _ {= \ mathbf {0} \; {\ text {from A1}}}) ^ {\ top} \ right) = \ operatorname {E} ({ \ boldsymbol {\ varepsilon}} {\ boldsymbol {\ varepsilon}} ^ {\ top}) = {\ begin {pmatrix} \ operatorname {Var} (\ varepsilon _ {1}) & \ operatorname {Cov} (\ varepsilon _ {1}, \ varepsilon _ {2}) & \ cdots & \ operatorname {Cov} (\ varepsilon _ {1}, \ varepsilon _ {T}) \\\\\ operatorname {Cov} (\ varepsilon _ { 2}, \ varepsilon _ {1}) & \ operatorname {Var} (\ varepsilon _ {2}) & \ cdots & \ operatorname {Cov} (\ varepsilon _ {2}, \ varepsilon _ {T}) \\ \\\ vdots & \ vdots & \ ddots & \ vdots \\\\\ operatorname {Cov} (\ varepsilon _ {T}, \ varepsilon _ {1}) & \ operatorname {Cov} (\ varepsilon _ {T} , \ varepsilon _ {2}) & \ cdots & \ operatorname {Var} (\ varepsilon _ {T}) \ end {pmatrix} } \\ & {\ stackrel {\ text {from A2}} {=}} \ sigma ^ {2} {\ begin {pmatrix} 1 & 0 & \ cdots & 0 \\ 0 & 1 & \ ddots & \ vdots \\\ vdots & \ ddots & \ ddots & 0 \\ 0 & \ cdots & 0 & 1 \ end {pmatrix}} _ {(T \ times T)} = \ sigma ^ {2} \ mathbf {I} _ {T} \ end {aligned}}}

Thus for ${\ displaystyle \ mathbf {y}}$

${\ displaystyle \ operatorname {E} (\ mathbf {y}) = \ mathbf {X} {\ boldsymbol {\ beta}} \ quad}$with .${\ displaystyle \ quad {\ mbox {Cov}} (\ mathbf {y}) = \ sigma ^ {2} \ mathbf {I} _ {T}}$

If in addition to the above classical linear regression model (short: KLRM ) or also called the standard model of linear regression , the assumption of the normal distribution of the disturbance variables is required, then one speaks of the classical linear model of normal regression . This is then given by

${\ displaystyle \ mathbf {y} = \ mathbf {X} {\ boldsymbol {\ beta}} + {\ boldsymbol {\ varepsilon}} \;}$with .${\ displaystyle \; {\ boldsymbol {\ varepsilon}} \ sim {\ mathcal {N}} \ left (\ mathbf {0}, \ sigma ^ {2} \ mathbf {I} _ {T} \ right)}$

## Maximum likelihood estimation

### Estimation of the slope parameter

The unknown variance parameters of a population and the slope parameter of the normal linear model can be using the maximum likelihood estimate. To do this, the individual probability density of the error vector that follows a normal distribution is first required. It is:

${\ displaystyle f (\ varepsilon _ {t} \ mid \ sigma ^ {2}) = {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ operatorname {exp} \ left \ {- {\ frac {\ varepsilon _ {t} ^ {2}} {2 \ sigma ^ {2}}} \ right \}}$, where represents.${\ displaystyle \ sigma ^ {2} = \ sigma _ {\ varepsilon} ^ {2}}$

Since the disturbance variable can also be represented as , the individual density can also be written as ${\ displaystyle \ varepsilon _ {t} = y_ {t} - \ mathbf {x} _ {t} ^ {\ top} {\ boldsymbol {\ beta}}}$

${\ displaystyle f (y_ {t} \ mid \ mathbf {x} _ {t} ^ {\ top}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) = {\ frac {1} { \ sqrt {2 \ pi \ sigma ^ {2}}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (y_ {t} - \ mathbf {x} _ {t} ^ {\ top } {\ boldsymbol {\ beta}} \ right) ^ {2}} {2 \ sigma ^ {2}}} \ right \}}$.

Due to the independence assumption, the common probability density can be represented as the product of the individual marginal densities . If stochastic independence is assumed, the common density is then ${\ displaystyle f}$ ${\ displaystyle f_ {1}, \ ldots, f_ {T}}$${\ displaystyle f (y_ {1}, y_ {2}, \ ldots, y_ {T} \ mid \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) = f (y_ {1} \ mid \ mathbf {x} _ {1} ^ {\ top}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ cdot f (y_ {2} \ mid \ mathbf {x } _ {2} ^ {\ top} {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ cdot \ ldots \ cdot f (y_ {T} \ mid \ mathbf {x} _ {T} ^ {\ top}, {\ boldsymbol {\ beta}}, \ sigma ^ {2})}$

{\ displaystyle {\ begin {aligned} f (y_ {1}, y_ {2}, \ dotsc, y_ {T} \ mid \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2 }) & = \ prod _ {t = 1} ^ {\ top} f_ {t} (y_ {t} \ mid \ mathbf {x} _ {t}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \\ & = {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (y_ {1 } - \ mathbf {x} _ {1} ^ {\ top} {\ boldsymbol {\ beta}} \ right) ^ {2}} {2 \ sigma ^ {2}}} \ right \} \ cdot \ ldots \ cdot {\ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (y_ {T} - \ mathbf {x } _ {T} ^ {\ top} {\ boldsymbol {\ beta}} \ right) ^ {2}} {2 \ sigma ^ {2}}} \ right \} \\ & = (2 \ pi \ sigma ^ {2}) ^ {- {\ frac {T} {2}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}} \ right) ^ {\ top} \ left (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}} \ right)} {2 \ sigma ^ {2}}} \ right \} \ end {aligned}}}

The common density can also be written as:

${\ displaystyle f (\ mathbf {y} \ mid \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) = (2 \ pi \ sigma ^ {2}) ^ {- { \ frac {T} {2}}} | \ mathbf {I} _ {T} | ^ {- {\ frac {1} {2}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}} \ right) ^ {\ top} \ mathbf {I} _ {T} \ left (\ mathbf {y} - \ mathbf { X} {\ boldsymbol {\ beta}} \ right)} {2 \ sigma ^ {2}}} \ right \}}$

Since we are not interested in a specific result for given parameters, but are looking for those parameters that best fit our data, i.e. which are assigned the greatest plausibility that they correspond to the true parameters, the likelihood function can now be described as Formulate common probability density depending on the parameters.

${\ displaystyle L ({\ varvec {\ beta}}, \ sigma ^ {2}; \ mathbf {y}, \ mathbf {X}) = (2 \ pi \ sigma ^ {2}) ^ {- {\ frac {T} {2}}} \ operatorname {exp} \ left \ {- {\ frac {\ left (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}} \ right) ^ { \ top} \ left (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}} \ right)} {2 \ sigma ^ {2}}} \ right \}}$

By logarithmizing the likelihood function, the logarithmic likelihood function results (also called logarithmic plausibility function) depending on the parameters:

${\ displaystyle \ ell ({\ varvec {\ beta}}, \ sigma ^ {2}; \ mathbf {y}, \ mathbf {X}) = \ ln \ left (L ({\ varvec {\ beta}} \ sigma ^ {2}; \ mathbf {y}, \ mathbf {X}) \ right) = - {\ frac {T} {2}} \ cdot \ ln (2 \ pi) - {\ frac {T} {2}} \ cdot \ ln (\ sigma ^ {2}) - {\ frac {\ left (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}} \ right) ^ {\ top } \ left (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}} \ right)} {2 \ sigma ^ {2}}}}$

This function must now be maximized with regard to the parameters. The following maximization problem arises:

${\ displaystyle {\ tilde {\ sigma}} ^ {2} = {\ underset {\ sigma ^ {2}} {\ operatorname {arg \, max}}} \ \ ell ({\ varvec {\ beta}} , \ sigma ^ {2} \ mid \ mathbf {y}, \ mathbf {X})}$
${\ displaystyle {\ tilde {\ varvec {\ beta}}} = {\ underset {\ varvec {\ beta}} {\ operatorname {arg \, max}}} \ \ ell ({\ varvec {\ beta}} , \ sigma ^ {2} \ mid \ mathbf {y}, \ mathbf {X})}$

The two score functions are:

${\ displaystyle \ left. {\ frac {\ partial \, \ ell ({\ varvec {\ beta}}, \ sigma ^ {2}; \ mathbf {y}, \ mathbf {X})} {\ partial \ , {\ varvec {\ beta}}}} \ right | _ {\ begin {array} {ccc} {\ varvec {\ beta}} = {\ tilde {\ mathbf {b}}} \\\ sigma ^ { 2} = {\ tilde {\ sigma}} ^ {2} \ end {array}} = - {\ frac {1} {2 \ sigma ^ {2}}} \ cdot \ underbrace {\ frac {\ partial ( (\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}}) ^ {\ top} \ left (\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}} \ right ))} {\ partial \, {\ varvec {\ beta}}}} _ {2 \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} {\ varvec {\ beta}} - \ mathbf {X} ^ {\ top} \ mathbf {y} \ right)} \; {\ overset {\ mathrm {!}} {=}} \; 0}$
${\ displaystyle \ left. {\ frac {\ partial \, \ ell ({\ varvec {\ beta}}, \ sigma ^ {2}; \ mathbf {y}, \ mathbf {X})} {\ partial \ , \ sigma ^ {2}}} \ right | _ {\ begin {array} {ccc} {\ varvec {\ beta}} = {\ tilde {\ mathbf {b}}} \\\ sigma ^ {2} = {\ tilde {\ sigma}} ^ {2} \ end {array}} = - {\ frac {T} {2 \ sigma ^ {2}}} + {\ frac {1} {2 \ sigma ^ { 4}}} \ cdot ((\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}}) ^ {\ top} \ left (\ mathbf {y} - \ mathbf {X} {\ varvec {\ beta}} \ right)) \; {\ overset {\ mathrm {!}} {=}} \; 0}$

With partial derivation it can be seen that the expression

${\ displaystyle {\ frac {\ partial \, ((\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ top} \ left (\ mathbf {y} - \ mathbf { X} {\ varvec {\ beta}} \ right))} {\ partial \, {\ varvec {\ beta}}}} = 2 \ mathbf {X} ^ {\ top} \ mathbf {y} +2 \ mathbf {X} \ mathbf {X} {\ varvec {\ beta}}}$

is already known from the derivation of the least squares estimator ( estimation of the parameter vector with the least squares estimation ). The maximum likelihood optimization problem is thus reduced to the least squares optimization problem . It follows that the least squares estimator ( KQS for short ) corresponds to the ML estimator ( MLS for short ):

${\ displaystyle {\ tilde {\ varvec {\ beta}}} = \ mathbf {b} = (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {y}}$

This further assumption (assumption of normal distribution) does not result in any difference for the estimation of the parameters. If the disturbances normally distributed are is maximum likelihood estimator and after the set of Lehmann-Scheffé best unbiased estimator ( best Unbiased Estimator - BUE ). As a consequence of the equality of the LQ and maximum likelihood estimators, the LQ and ML residuals must also be the same ${\ displaystyle \ mathbf {b}}$

${\ displaystyle {\ tilde {\ boldsymbol {\ varepsilon}}} = \ left (\ mathbf {y} - \ mathbf {X} {\ tilde {\ boldsymbol {\ beta}}} \ right) = \ left (\ mathbf {y} - \ mathbf {X} \ mathbf {b} \ right) = {\ hat {\ boldsymbol {\ varepsilon}}}}$

### Estimation of the variance parameter

The maximum likelihood estimator for the variance, which also results from the second partial derivative and the circumstance , is: ${\ displaystyle {\ hat {\ sigma}} ^ {2} = {\ frac {{\ hat {\ boldsymbol {\ varepsilon}}} ^ {\ top} {\ hat {\ boldsymbol {\ varepsilon}}}} {TK}} \ Leftrightarrow {\ hat {\ sigma}} ^ {2} (TK) = {\ hat {\ varepsilon}}} ^ {\ top} {\ hat {\ varepsilon {\ varepsilon}} }}$

${\ displaystyle {\ tilde {\ sigma}} ^ {2} = {\ frac {(\ mathbf {y} - \ mathbf {X} {\ tilde {\ boldsymbol {\ beta}}}) ^ {\ top} (\ mathbf {y} - \ mathbf {X} {\ tilde {\ boldsymbol {\ beta}}})} {T}} = {\ frac {{\ tilde {\ boldsymbol {\ varepsilon}}} ^ {\ top} {\ tilde {\ boldsymbol {\ varepsilon}}}} {T}} = {\ frac {{\ hat {\ boldsymbol {\ varepsilon}}} ^ {\ top} {\ hat {\ varepsilon {\ varepsilon }}}} {T}} = {\ frac {{\ hat {\ sigma}} ^ {2} (TK)} {T}}}$

The ML estimator results as the average sum of squares . However, the estimator does not meet common quality criteria for point estimators because it does not represent an unbiased estimate of the variance of the confounding variables . The value of the logarithmic plausibility function, evaluated in place of the estimated parameters:

${\ displaystyle \ ell (\ mathbf {b}, {\ tilde {\ sigma}} ^ {2}; \ mathbf {y}, \ mathbf {X}) = \ ln \ left (L (\ mathbf {b} , {\ tilde {\ sigma}} ^ {2}; \ mathbf {y}, \ mathbf {X}) \ right) = - {\ frac {T} {2}} \ cdot \ ln (2 \ pi) - {\ frac {T} {2}} \ cdot \ ln ({\ tilde {\ sigma}} ^ {2}) - {\ frac {\ left (\ mathbf {y} - \ mathbf {X} \ mathbf {b} \ right) ^ {\ top} \ left (\ mathbf {y} - \ mathbf {X} \ mathbf {b} \ right)} {2 {\ tilde {\ sigma}} ^ {2}}} }$

## generalization

While one assumes in the classical linear model of the normal regression that the disturbance variable (the unobservable random component) is normally distributed , the disturbance variable in generalized linear models can have a distribution from the class of the exponential family .

## Individual evidence

1. George G. Judge, R. Carter Hill, W. Griffiths, Helmut Lütkepohl , TC Lee. Introduction to the Theory and Practice of Econometrics. John Wiley & Sons, New York, Chichester, Brisbane, Toronto, Singapore, ISBN 978-0471624141 , second edition 1988, p. 221 ff.