Pseudo-coefficient of determination

from Wikipedia, the free encyclopedia

In case of a linear regression that describes certainty measure the stated proportion of the variability ( variance ) of a dependent variable by a statistical model. With a nominal or ordinal scale level of , however, there is no equivalent, since the variance and thus a cannot be calculated. With the help of the maximum likelihood estimation , however, more general regression models can be estimated. Various pseudo-coefficients of determination (noted as ) have been proposed for these .

The pseudo-coefficient of determination

Pseudo- coefficients of determination are constructed in such a way that they satisfy the various interpretations (e.g. explained variance, improvement over the null model or as the square of the correlation) of the coefficient of determination . They are similar to this in that its values ​​are also in the interval between 0 and 1, and a higher value corresponds to a better fit of the model to the data.

Likelihood-based measures

Maddalas / Cox & Snells pseudo-R 2

,

With

: Null model,
: Model with explanatory variables

Compares the ratio of the values ​​of the value of the likelihood function , in which the complete independence of all variables is assumed (null model or empty model ) and the likelihood functions, knowing the relationship between and (full regression model). The lower this ratio, the greater the improvement of the entire model compared to the null model. Maddalas can never reach the value 1, even with a perfect forecast.

Nagelkerkes / Cragg & Uhlers Pseudo-R 2

,

With

: Null model,
: Model with explanatory variables

Nagelkerke's pseudo-R 2 expands Maddala's pseudo-R 2 so that a possible value of 1 can be achieved through a rescaling if the complete model makes a perfect prediction with a probability of 1.

Nagelkerke also gave general conditions for a pseudo-coefficient of determination:

  1. A pseudo-coefficient of determination should match the coefficient of determination if both can be calculated.
  2. It should also be maximized with the model's maximum likelihood estimate.
  3. It should be, at least asymptotically, independent of the sample size.
  4. The interpretation should be the variability of explained by the model .
  5. It should be between zero and one. If the value is zero, it should not make any statement about the variability of ; if the value is one, it should fully explain the variability of .
  6. It shouldn't have a unit of measurement.

Log likelihood based measures

McFadden R 2

,

With

: Null model,
: Model with explanatory variables

The ratio of the logarithm of the values and the likelihood function (probabilities) reflects the degree of improvement of the complete model with predictors over the null model. A model with a larger McFaddens will have a better fit over another model with a lower score.

Rule of thumb: Already represents a particularly good adaptation of the model.

McFadden's corrected R 2

The corrected McFaddens evaluates the number of predictors for the goodness of fit of a model. Similar to the corrected coefficient of determination , too many predictors that do not contribute enough to the model reduce the effectiveness of a model and are negatively reflected in the corrected McFaddens . This means that values ​​less than 0 are possible.

Aldrich / Nelsons R 2

, c = 1 ( probit model ), 3.29 ( logit model )

Aldrich / Nelsons puts the likelihood quotient in relation, which indicates the rate of the null model and the alternative model when the event has occurred. It has an upper limit of well below 1.

Correlation-based measures

Lave / Efrons R 2

Lave / Efrons can be interpreted as the square of the correlation and as an explained variability, similar to the normal coefficient of determination . The squared residuals are added up, with a probability predicted by the model for which converts the discrete dependent variable into a continuous one (note: can only assume the values ​​0 and 1).

Based on the explained variation

McKelvey & Zavoinas R 2

McKelvey & Zavoinas is structurally based on the normal coefficient of determination. The estimated explained sum of squares of the regression is related to the estimated sum of explained and unexplained squares of the regression and error.

comparability

The values ​​of the various pseudo-coefficients of determination can vary widely within a model. This means that different measures between different data sets cannot be compared and interpreted independently. McKelvey & Zavoinas has proven to be the best approximation; Laves, McFadden, Nagelkerkes underestimate the "true" a least squares estimation strong for a model with latent variables.

example

A clothes peg manufacturer would like to bring its new clothes pegs onto the market and therefore calculate the probability of a purchase in advance. He consults with his business partner who has a statistics program. This assumes that the purchase only depends on one attribute, the price . The aggregate influence on the purchase decision should have a linear relationship , also called logit . The clothespin producer, however, believes rather that the purchase intent by price, color and size dependent: . By market research data are the regression parameters , , and for the maximum likelihood estimate been determined iteratively from the computer. However, the clothespin manufacturer is now wondering which model hypothesis reflects reality better and which further considerations should be based on. Various pseudo-coefficients of determination should be used to assess the goodness of fit of the assumed models to the available data. The two business partners can output these from the statistics program.

Goodness of fit measures    Model 1 ( )    Model 2 ( )   
McFadden R 2 0.307 0.445
McFadden Adj R 2 0.273 0.389
Cragg-Uhler (Nagelkerke) R 2 0.436 0.578
McKelvey & Zavoina R 2 0.519 0.643
Efron / Lave R 2 0.330 0.472

Since the pseudo-coefficients of determination for model 2 are consistently higher, i. In other words, if this model is a better representation of the market research data, it is decided in favor of it and thus the purchase probability or the possible market share is estimated.

Web links

literature

  • Cragg, JG, Uhler, R. (1970), "The Demand for Automobiles", Canadian Journal of Economics 3, pp. 386-406, JSTOR 133656 .
  • Hagle, TM, Mitchell II, GE (1992), "Goodness-of-Fit Measures for Probit and Logit", American Journal of Political Science 36, pp. 762-784, JSTOR 2111590 .
  • McFadden, D. (1973), "Conditional Logit Analysis of Qualitative Choice Behavior" (PDF 1.77 MB), in: P. Zarembka (ed.) Frontiers in Econometrics, Academic Press: New York, ISBN 0-12-776150 -0 , pp. 105-142.
  • McKelvey, R., Zavoina, W. (1975), "A Statistical Model for the Analysis of Ordinal Level Dependent Variables", Journal of Mathematical Sociology 4, pp. 103-120, doi : 10.1080 / 0022250X.1975.9989847 .
  • Nagelkerke, NJD (1991), "A Note on a General Definition of the Coefficient of Determination", Biometrika 78, No. 3, pp. 691-692, doi : 10.1093 / biomet / 78.3.691 .
  • Veall, MR, Zimmermann, KF (1996), "Pseudo-R 2 Measures for Some Common Limited Dependent Variable Models", Collaborative Research Center 386, Paper 18, doi : 10.5282 / ubm / epub.1421 .

Individual evidence

  1. ^ Veall, Zimmermann (1996), "Pseudo-R 2 Measures for Some Common Limited Dependent Variable Models", Collaborative Research Center 386, Paper 18.