Panel data analysis

from Wikipedia, the free encyclopedia

The panel data is the statistical analysis of panel data within the panel research. The panel data connect the two dimensions of a cross section and a time series . The essential point of the analysis lies in the control of unobserved heterogeneity of the individuals .

Depending on the model chosen, a distinction is made between cohort, period and age effects . The amount of observations increases the number of degrees of freedom and decreases the collinearity , making the estimators more efficient . Compared to multiple, independent cross-sectional regressions, panel data lead to better results when estimating exogenous variables . By using an individual-specific constant, the influence of constant, non-modeled variables can be captured; this makes the estimators more robust against incomplete model specification .

The gold standard of empirical research is the randomized controlled trial , which allows an analysis of causal relationships between the observed variables. Although a panel is still observational (there is no intervention), a key goal is to get as close as possible to causal analysis.

Static linear models

Static models do not take into account the development of the dependent variable over time. The use of static models makes sense if the reaction of the individual depends only on the exogenous variables, but not on older values ​​of the size under consideration. They include the pooled model and panel data models with random or fixed effects.

Pooled model

In the pooled model, the heterogeneity of the observations both in time and in the cross-sectional dimension is neglected, as in the usual linear regression model , all coefficients are considered non-stochastic and identical for all observations. The estimators are more efficient than with T cross-sectional regressions with per observations, a with increasing number of observations the standard error of the coefficients decreases, provided that they do not differ significantly; However, heterogeneity leads to biased estimates . It is also questionable whether the observations are independent if the same individuals are questioned repeatedly (“ serial correlation ”).

Model with random effects

In panel data random effects model , or more precisely Model with random intercept ( English random intercept model ), a specific individuals will intercept introduced, which for each individual implementation is one identically distributed random variables for all individuals:

, with .

Here represents the value of the variable to be explained, the vector of the explanatory variable and the vector of the regression coefficient. The total error is made up of the individual-specific axis intercept and the idiosyncratic (time-variable, systematic) error .

Fixed effects model

In the panel data model with fixed effects, on the other hand, the axis intercept varies systematically, while it remains the same for all individuals. They are therefore parameters to be estimated and model the heterogeneity of the individuals as in the RE model only through a level shift - i.e. through different ones . The influence of the explanatory variables should be the same for all individuals. This procedure thus explains why an observation deviates from the individual mean , but not the differences in the (mean) values ​​of different individuals. Therefore, time constant variables are not identified in the model with fixed effects.

Examples:
  • the unobservable skills of management influence the profit situation of companies
  • Training influences the salary situation of employees

Comparison of the models

In general, models with random effects should be preferred if the characteristics of a population are to be derived from a few individuals. Models with fixed effects are particularly useful when predictions ( inferences ) are only to be made for the sample under consideration; however, they should also be used in the above case if and are correlated and the model with random effects thus leads to inconsistent and distorted estimates. One argument against FE models is the loss of degrees of freedom, since a new variable has to be estimated with each individual. If the variance of the values ​​of an individual (within variance) is very much smaller than the variance between the individuals (between variance), the FE model is disadvantageous: One ignores part of the information and assumes that the mean values ​​of nothing above state the relationship of the variables.

is based on static methods, but uses the time-dependent variable that applies to all individuals to depict level differences in the various periods. can be estimated in the same way as in an FE or RE model. Since the time-dependent constant has to be redefined for each period, this model is not suitable for forecasting .

Another way changes to be considered over time, is the use of so-called distributed lag models (also models with distributed delays , English distributed lag models ), the effect spread an altered independent variable on the explained variable over an infinite time horizon . Such a construction therefore explains delayed effects for psychological, technological or institutional reasons. Particular attention must be paid to multicollinearity in these models . In addition, problems arise with the choice of the correct number of delayed observations and a loss of observation values: As the number of parameters increases, the number of observations available decreases.

The Hausman specification test is a test procedure to decide whether a model with fixed effects (FE model) or a model with random effects (RE model) is present.

Dynamic models

Dynamic models contain a delayed endogenous variable implicitly via the error term (autoregressive models) or explicitly (LDV = “lagged dependent variable”) (for example , if it is to be explained). This approach implements the intuitively plausible idea that the level of a previous year is a primitive forecast for the current size. The dynamic LDV model is:

, with , d. H. all error terms are independently and identically distributed ( iid = independently and identically distributed ) with expected value and variance .

The coefficient cannot be interpreted causally (as in the static model), but describes the speed at which the dynamic effect adapts.

Estimation procedure

Estimation method in the static models

For static models are pooled least squares estimator , the LSDV estimator ( LSDV for least squares dummy variable , German about the least squares estimator with dummy variables ) in the fixed effects model and the Feasible generalized least squares estimation (GVKQ) ( GVKQ estimator for short ) is used in the model with random effects.

Estimation method in dynamic models

In dynamic models, the delayed endogenous variable depends on , since the error terms transformed to individual mean values ​​and the delayed variables are correlated with one another - this applies regardless of whether they are viewed as fixed or random. Therefore, KQ estimators are distorted and inconsistent for finite time horizons T; even for the distortions are still very clear, for the asymptotic distortion is . This Landau symbol simply means that the distortion decreases at most as fast as . Therefore, an alternative offer certain generalized moment estimator ( English generalized method of moments ), a generic term for many linear and non-linear estimation methods including the least-squares estimation and now to be discussed instrumental variables (IV). Such methods do not require any assumptions about the distribution of the error terms, allow heteroscedasticity and can be solved (numerically) even if an analytical solution is not possible. If the explanatory variable is correlated with the error term, IV estimators lead to consistent estimators, provided no other conditions are violated. As in this case, this correlation can be caused by endogenous variables, but also by disregarded explanatory variables, self-selection (individuals only take part in the survey if their opinion is positive) or by measurement errors. In the IV method, the correlation between and at least asymptotically is eliminated by replacing with values ​​that are closely related (i.e. relevant), but do not correlate with or represent a linear combination of other explanatory variables and are therefore valid. If the number of instruments corresponds to the number of explanatory variables, one speaks of the IV model (here exogenous variables can be their own instruments) , then the model is over-identified and one obtains the more efficient, but possibly more distorted in finite samples GIVE, the “generalized instrumental variables estimator”. The estimator in the case is , where is the matrix of available instruments. This equation can also be derived from the GIVE for :

if .

This estimator results from the minimization of a quadratic function of the sample moments. As long as the - weight matrix is ​​positive definite, the estimators will be consistent, since the quadratic equation to be minimized can only assume positive values ​​and tends towards zero as N increases. Since every scalar multiple of the inverse covariance matrix of the sample moments leads to efficient estimators, the optimal weight matrix results under the assumption :

.

The resultant GIVE is also two-stage least squares estimator ( English two stage least squares estimator short: 2SLS estimator ) called, he also made two consecutive KQ regressions can be formed.

Simulation studies have shown that the variances of the IV estimators are often quite large for small to medium-sized samples. This is especially true in comparison to KQ estimators and is exacerbated by a low correlation between the endogenous regressor and IV, since the estimators are then inconsistent even with a low correlation between the IV and the error term. The number of necessary observations depends on the model context. Another problem is the choice of instruments: In the simplest case, for example, exogenous variables from previous periods or differences from these can be used, the further away these are in time, the weaker they are likely to be. There are also computational limits: an IV estimator proposed by Ahn / Schmidt with additional moment conditions for 15 periods and 10 explanatory variables reaches 2,250 columns. These dimensions can not be solved by many programs even today. The assumptions made regarding the moment conditions cannot be tested statistically. Only if there are more conditions than necessary ( ) can a statement be made as to whether moment conditions are superfluous, but not which ones. If the instruments are valid, more moment conditions lead to more efficient estimators. The Arellano Bond estimator (AB estimator) increases the number of these conditions by taking into account lagged levels of the dependent and predetermined variables and changes in the exogenous variables to:

  • Conditions at a model with a delayed variables and no exogenous variables: ,
  • Conditions for a model with one lagged variable and K strictly exogenous variables,
  • Conditions in a model with one lagged variable and c exogenous, predetermined variables. In contrast to strictly exogenous variables, these are dependent on previous implementations of the error term: for and zero otherwise.

In general, this results in the following estimator:

,

with the matrix of the moment conditions, the weight matrix and the changes in the explained or explanatory variables, ( ) and ( ). However, the method assumes uncorrelated error terms. In final tests, it should be noted that the standard errors are skewed downwards, which can lead to an unjustified neglect of an explanatory variable. This procedure can also be used for unbalanced panel data with minor adjustments.

literature

  • Badi H. Baltagi: Econometric Analysis of Panel Data . 5th edition. John Wiley & Sons, 2013, ISBN 978-1-118-69922-5 .
  • U. Engel, J. Reinecke: Panel analysis: basics, techniques, examples. de Gruyter, Berlin 1994, ISBN 3-11-013570-1 .
  • Edward W. Frees: Longitudinal and Panel Data - analysis and applications in the social sciences. Cambridge University Press, Cambridge et al. 2004.
  • M. Giesselmann, M. Windzio: Regression models for the analysis of panel data. Springer VS, Wiesbaden 2012, ISBN 978-3-531-18694-8 .
  • BO Muthén: Latent Variable Analysis: Growth mixture modeling an related techniques for longitudinal data. In: David Kaplan (Ed.): The Sage handbook of quantitative methodology for the social sciences. Sage, Thousand Oaks 2004, ISBN 0-7619-2359-4 , pp. 345-368.
  • Jeffrey M. Wooldridge: Econometric analysis of cross section and panel data. 2nd Edition. MIT Press, Cambridge 2010, isbn ISBN 978-0-262-23258-6 .

Individual evidence

  1. ^ Y. Croissant, G. Millo: Panel Data Econometrics with R. John Wiley & Sons, 2018, ISBN 978-1-118-94916-0 , p. 1.
  2. Cheng Hsiao: Analysis of panel data. (= Econometric Society monographs. No. 54). Cambridge university press, 2014, ISBN 978-1-107-65763-2 , pp. 4–10.
  3. ^ SE Finkel: Causal analysis with panel data. (= Quantitative applications in the social sciences. No. 105). Sage 1995, ISBN 0-8039-3896-9 .
  4. Ludwig Fahrmeir , Thomas Kneib , Stefan Lang: Regression: Models, Methods and Applications. Springer Verlag, 2009, ISBN 978-3-642-01836-7 , p. 253.

Web links