Standard error of regression

The (estimated) standard error of regression ( english (estimated) standard error of regression, in short: SER ), and standard error , standard error of estimate ( English standard error of the estimate ), or the square root of the mean squared error ( English root mean squared error , RMSE for short ) is the statistics and especially in the regression analysis a measure of the accuracy of the regression . It is defined as the square root of the unbiased estimator for the unknown variance of the disturbance variables (the residual variance ) and can be interpreted as the square root of the " average residual square " ( English root mean squared error , abbreviated RMSE ), which when using the calculated regression line to predict the Target variables arise. It measures the average distance between the data points and the regression line. The standard error of the regression can be used to estimate the variances of the regression parameters as they depend on the unknown standard deviation . The standard error of regression and the coefficient of determination are the most commonly used measures in regression analysis . However, the standard error of regression follows a different philosophy than the coefficient of determination. In contrast to the coefficient of determination, which quantifies the explanatory power of the model, the standard error of the regression gives an estimate of the standard deviation of the unobservable effects that affect the outcome (or, equivalently, an estimate of the standard deviation of the unobservable effects that affect the outcome after the effects of the explanatory Variables were removed). The standard error of the regression is usually noted with or . Occasionally it is also noted. ${\ displaystyle \ sigma}$ ${\ displaystyle {\ hat {\ sigma}}}$ ${\ displaystyle {\ text {SER}}}$ ${\ displaystyle s}$

Introduction to the problem

The "quality" of regression can use the estimated standard error of the residuals (Engl. Residual standard error ) be judged, is one of the standard output of most statistical software packages. The estimated standard error of the residuals indicates the certainty with which the residuals come closer to the true confounding variables . The residuals are thus an approximation of the disturbance variables . The estimated standard error of the residuals is comparable to the coefficient of determination and the adjusted coefficient of determination and is to be interpreted similarly. The estimated residual standard error is defined by ${\ displaystyle {\ hat {\ varepsilon}} _ {i}}$ ${\ displaystyle \ varepsilon _ {i}}$ ${\ displaystyle \ varepsilon _ {i} \ approx {\ hat {\ varepsilon}} _ {i}}$

{\ displaystyle {\ tilde {s}} = {\ sqrt {{\ tfrac {1} {n}} \ sum \ nolimits _ {i = 1} ^ {n} {\ hat {\ varepsilon}} _ {i } ^ {2}}}}

.

It should be noted, however, that this is a biased estimate of the true variance of the confounding variables because the variance estimator used is not unbiased . When it is considered that by the estimation of the two regression parameters and loses two degrees of freedom and thus held by the sample size by the number of degrees of freedom divided, to the "middle Residuenquadrat" is obtained ( M edium Q uadratsumme the R esiduen , in short, MQR ) and thus the undistorted representation. This unbiased representation is known as the standard error of regression. ${\ displaystyle {\ tilde {s}} ^ {2}}$ ${\ displaystyle \ sigma _ {\ varepsilon} ^ {2} = \ sigma ^ {2}}$ ${\ displaystyle \ beta _ {0}}$ ${\ displaystyle \ beta _ {1}}$ ${\ displaystyle n}$ ${\ displaystyle (n-2)}$ ${\ displaystyle MQR = SQR / (n-2)}$

definition

The standard error of the regression is defined as the square root of the unbiased estimate for the variance of the disturbance variables , the so-called residual variance

{\ displaystyle {\ hat {\ sigma}} = + {\ sqrt {{\ hat {\ sigma}} ^ {2}}}}

.

The standard error of the regression has the same unit as the target variable . The standard error of the regression is usually smaller than the standard error of the values. It should be noted that the standard error of the regression can either decrease or increase as (for a given sample) another explanatory variable is added to the regression model. This is because the residual sum of squares always decreases when another explanatory variable is added to the regression model, but the degrees of freedom also decrease by one or p. Since the residual sum of squares is in the numerator and the number of degrees of freedom is in the denominator, one cannot predict which effect will dominate. For the derivation of the standard error of the regression one usually assumes that the residuals are uncorrelated , have an expected value of zero and a homogeneous variance ( Gauss-Markov assumptions ). If at least one of these assumptions is violated, the standard error of the regression calculated according to the above formula will not estimate the true value on average , i.e. H. be a biased estimate of the unknown standard deviation. ${\ displaystyle y}$

Simple linear regression

In simple linear regression, the standard error of the regression is defined by

{\ displaystyle {\ hat {\ sigma}} = + {\ sqrt {SQR / (n-2)}} = + {\ sqrt {{\ frac {1} {n-2}} \ sum \ limits _ { i = 1} ^ {n} {\ hat {\ varepsilon}} _ {i} ^ {2}}} = + {\ sqrt {{\ frac {1} {n-2}} \ sum \ limits _ { i = 1} ^ {n} \ left (y_ {i} -b_ {0} -b_ {1} x_ {i} \ right) ^ {2}}}}

, with the least squares estimator and , for the slope and the intercept .

{\ displaystyle b_ {1} = {\ frac {\ sum \ nolimits _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) (y_ {i} - {\ overline { y}})} {\ sum \ nolimits _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) ^ {2}}} \;}

{\ displaystyle \; b_ {0} = {\ overline {y}} - b_ {1} {\ overline {x}}}

{\ displaystyle \ beta _ {1}}

{\ displaystyle \ beta _ {0}}

The representation is undistorted, as it is true to expectations under the Gauss-Markov assumptions by including the degrees of freedom of the variance estimators (see also estimators for the variance of the disturbance variables ). The standard error of the regression is calculated as the square root of the average residual square and is an independent model of goodness. It indicates how large the average deviation of the measured values from the regression line is. The larger the standard error of the regression, the worse the regression line describes the distribution of the measured values. The standard error of the regression is usually smaller than the standard error of the target variable . The coefficient of determination is reported more often than the standard error of the residuals, although the standard error of the residuals may be more useful in assessing goodness of fit. If the standard error of the regression in simple linear regression is inserted into the variance formulas for and , then one obtains unbiased estimates for and ${\ displaystyle \ mathbb {E} ({\ hat {\ sigma}} ^ {2}) = \ sigma ^ {2}}$ ${\ displaystyle {\ hat {\ sigma}} _ {y}}$ ${\ displaystyle \ beta _ {0}}$ ${\ displaystyle \ beta _ {1}}$ ${\ displaystyle \ sigma _ {{\ hat {\ beta}} _ {0}} ^ {2}}$ ${\ displaystyle \ sigma _ {{\ hat {\ beta}} _ {1}} ^ {2}}$

{\ displaystyle {\ hat {\ sigma}} _ {{\ hat {\ beta}} _ {0}} ^ {2} = {\ hat {\ sigma}} ^ {2} {\ frac {\ sum \ nolimits _ {i = 1} ^ {n} x_ {i} ^ {2}} {n \ sum \ nolimits _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) ^ {2}}} \;}

and .

{\ displaystyle \; {\ hat {\ sigma}} _ {{\ hat {\ beta}} _ {1}} ^ {2} = {\ hat {\ sigma}} ^ {2} {\ frac {1 } {\ sum \ nolimits _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) ^ {2}}}}

Furthermore, using the standard error of the residuals, confidence intervals can be constructed.

Multiple linear regression

In multiple linear regression , the standard error of the regression is defined by

{\ displaystyle {\ hat {\ sigma}} = + {\ sqrt {MQR}} = + {\ sqrt {SQR / (nk-1)}} = + {\ sqrt {\ frac {{\ hat {\ varvec {\ varepsilon}}} ^ {\ top} {\ hat {\ boldsymbol {\ varepsilon}}}} {nk-1}}} = + {\ sqrt {\ frac {\ left (\ mathbf {y} - \ mathbf {X} \ mathbf {b} \ right) ^ {\ top} \ left (\ mathbf {y} - \ mathbf {X} \ mathbf {b} \ right)} {nk-1}}}}

with the least squares estimator .

{\ displaystyle \ mathbf {b} = (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {y}}

An alternative representation of the standard error of the regression results from the fact that the residual square sum can also be represented by means of the residual- generating matrix as . This gives the standard error of the regression ${\ displaystyle SQR = {\ hat {\ varepsilon}}} ^ {\ top} {\ hat {\ varepsilon}}} = {\ varepsilon {\ varepsilon}} ^ {\ top} \ mathbf {Q} {\ boldsymbol {\ varepsilon}}}$

{\ displaystyle {\ hat {\ sigma}} = {\ sqrt {\ frac {\ mathbf {y} ^ {\ top} \ mathbf {y} - \ mathbf {b} ^ {\ top} \ mathbf {X} ^ {\ top} \ mathbf {y}} {np}}} = {\ sqrt {\ frac {\ mathbf {y} ^ {\ top} {\ varvec {Q}} \ mathbf {y}} {np} }} = {\ sqrt {\ frac {{\ boldsymbol {\ varepsilon}} ^ {\ top} {\ boldsymbol {Q}} {\ boldsymbol {\ varepsilon}}} {np}}}}

If one replaces the unknown with the known in the standard deviation of the respective parameter estimator , the standard error of the regression coefficient results from ${\ displaystyle {\ sqrt {\ operatorname {Var} (b_ {j})}}}$ ${\ displaystyle \ sigma}$ ${\ displaystyle {\ hat {\ sigma}}}$ ${\ displaystyle b_ {j}}$

{\ displaystyle \ operatorname {SE} (b_ {j}) = {\ sqrt {\ frac {{\ tfrac {1} {np}} \ sum \ nolimits _ {i = 1} ^ {n} {\ has { \ varepsilon}} _ {i} ^ {2}} {(1 - {\ mathit {R}} _ {j} ^ {2}) \ sum \ nolimits _ {i = 1} ^ {n} (x_ { ij} - {\ overline {x}} _ {j}) ^ {2}}}}}

.

The size of the standard errors of the estimated regression parameters therefore depends on the residual variance, the interdependence of the explanatory variables and the scatter of the respective explanatory variables.

Individual evidence

↑ Peter Hackl : Introduction to Econometrics. 2nd updated edition, Pearson Deutschland GmbH, 2008., ISBN 978-3-86894-156-2 , p. 72.
↑ Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 102.
↑ Werner Timischl : Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 313.
↑ Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 110.
^ A. Colin Cameron, Pravin K. Trivedi: Microeconometrics. Methods and Applications. Cambridge University Press, 2005, ISBN 0-521-84805-9 , p. 287.
↑ Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 58.
↑ Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 5th edition. Nelson Education, 2015, p. 101.

[1] Peter Hackl : Introduction to Econometrics. 2nd updated edition, Pearson Deutschland GmbH, 2008., ISBN 978-3-86894-156-2 , p. 72.

[2] Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 102.

[3] Werner Timischl : Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 313.

[4] Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 110.

[5] A. Colin Cameron, Pravin K. Trivedi: Microeconometrics. Methods and Applications. Cambridge University Press, 2005, ISBN 0-521-84805-9 , p. 287.

[6] Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 4th edition. Nelson Education, 2015, p. 58.

[7] Jeffrey Marc Wooldridge: Introductory econometrics: A modern approach. 5th edition. Nelson Education, 2015, p. 101.