Panel Data Analysis II

In this section we estimate our first set of statistical models using panel data: Pooled OLS and Between Effects. We show some examples of how to estimate and interpret these models, and reflect on the conditions under which the models are appropriate.

What we can relax about

In the sessions demonstrating how to quantitatively analyse panel data, we will cast aside the following concerns:

  • Missing data

  • Weights

  • Attrition

  • Multicollinearity

All of these issues impinge on the estimation of panel data models but are not necessary to address for the purposes of learning about said models. We encourage you to consult the reading list for suggestions of resources that cover these topics.

Defining our statistical model

Now we arrive at the interesting bit: estimating statistical models.

Let’s return to our panel data on charities and define a statistical model for predicting a charity’s annual gross income as a function of its age, the scale of its charitable activities, where it is located, what type of charity it is, and the number of sources of income it has, and the share of its income provided by government.

\[ \text{y}_{it} = \beta_0 + \beta_1x_{1it} + \beta_2x_{2i} + \beta_3x_{3i} + \beta_4x_{4i} + \beta_5x_{5it} + \beta_6x_{6it} + \epsilon_{it} \tag{1.6} \]

Where:

\(\text{y}_{it}\) is log income for charity i at time t

\(\beta_0\) is the constant term, which is our prediction for log income when the values of all other variables in the model are set to 0

\(\text{x}_{1it}\) captures the age of charity i at time t, and \(\beta_1\) is the effect of this variable on the outcome

\(\text{x}_{2i}\) is a dummy variable identifying charities that operate at a local level

\(\text{x}_{3i}\) is a dummy variable identifying charities registered in Westminster

\(\text{x}_{4i}\) is a dummy variable identifying general charities

\(\text{x}_{5it}\) captures the number of sources of income for charity i at time t

\(\text{x}_{6it}\) captures the share its income charity i derives from government sources at time t

\(\epsilon_{it}\) captures the residual for charity i at time t (\(\text{y}_{it} - \hat{y}_{it}\))

Understanding sources of variation

Remember to keep in mind the two sources of variation that exist in panel data (Gould, n.d.):

  1. Cross-section information on differences between units

  2. Time series information on differences over time within units

Data exploration

Let’s spend a little bit of time exploring the key variables in our statistical model.

use "./data/charity-panel-analysis-2020-09-10.dta", clear
(Contains annual accounts of charities in E&W for financial years 2006-2017)

Histogram of Log Income

Histogram of Government Funding Share

sum orgage, detail
                  Age of charity - in years
-------------------------------------------------------------
      Percentiles      Smallest
 1%            4              0
 5%            7              1
10%           10              1       Obs              23,826
25%           16              1       Sum of Wgt.      23,826
50%           27                      Mean           39.20129
                        Largest       Std. Dev.       42.4661
75%           48            496
90%           82            497       Variance       1803.369
95%          112            498       Skewness       4.595531
99%          180            499       Kurtosis       37.17673
sum nsources, detail
      Number of income sources where income >= £1,000
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              0
 5%            2              0
10%            2              1       Obs              23,826
25%            3              1       Sum of Wgt.      23,826
50%            4                      Mean           3.806724
                        Largest       Std. Dev.       1.24789
75%            5              6
90%            5              6       Variance       1.557228
95%            6              6       Skewness      -.1130695
99%            6              6       Kurtosis       2.425233
tab1 localc socser
-> tabulation of localc  
      Local |
    charity |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      8,756       36.75       36.75
          1 |     15,070       63.25      100.00
------------+-----------------------------------
      Total |     23,826      100.00
-> tabulation of socser  
     Social |
    service |
    charity |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     20,449       85.83       85.83
          1 |      3,377       14.17      100.00
------------+-----------------------------------
      Total |     23,826      100.00

Pooled OLS Model

The starting point for any statistical modelling of panel data is to estimate a Pooled OLS model (fancy way of saying linear regression).

The observations are “pooled”, which just means we ignore the nested nature of panel data. In other words we assume that each observation (i.e., row within a long format data set) is independent of other observations (Gayle and Lambert, 2018).

Fundamental problem of pooling observations (Gayle & Lambert, 2018, p. 58):

The model does not recognise that there are multiple contributions of data from the same individuals, and therefore, it estimates results as if there are many individuals who shared the same characteristics. This impacts upon the estimate of measures such as variances and standard errors.

regress linc orgage localc west genchar nsources govern_share
est store pols
      Source |       SS           df       MS      Number of obs   =    23,826
-------------+----------------------------------   F(6, 23819)     =    410.54
       Model |   2225.8864         6  370.981066   Prob > F        =    0.0000
    Residual |  21523.6961    23,819  .903635591   R-squared       =    0.0937
-------------+----------------------------------   Adj R-squared   =    0.0935
       Total |  23749.5825    23,825  .996834524   Root MSE        =     .9506
------------------------------------------------------------------------------
        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      orgage |   .0036028     .00015    24.01   0.000     .0033087    .0038969
      localc |  -.3302434   .0130224   -25.36   0.000    -.3557682   -.3047187
        west |   .1121865   .0253139     4.43   0.000     .0625697    .1618033
     genchar |  -.3170303   .0139082   -22.79   0.000    -.3442913   -.2897693
    nsources |   .1053884   .0050963    20.68   0.000     .0953993    .1153774
govern_share |    .000644   .0002035     3.16   0.002     .0002451    .0010429
       _cons |   14.96317   .0236406   632.94   0.000     14.91683    15.00951
------------------------------------------------------------------------------

Conditions where Pooled OLS is suitable

Pooled OLS can produce consistent estimates of the explanatory variables if:

  • The model is correctly specified

  • The explanatory variables are uncorrelated with the error term (Cameron and Trivedi, 2010)

TASK: Do you think our statistical model is correctly specified, and there is no correlation between error term and explanatory variables?

In our statistical model of charity income, it is unlikely that the interpretation of the coefficients would change drastically if we addressed the under-estimation of the standard errors (the sample size is very large).

We’ll cover the various tests and checks we can perform to examine whether Pooled OLS model violates the independence of errors assumption in a later section.

Between Effects Model

Once again estimate a cross-sectional model (Pooled OLS). However this time we transform the data so that there is one observation per unit. As a result we end up modelling the mean of Y on the mean of our X variables.

xtreg linc orgage localc west genchar nsources govern_share, be
est store beff
Between regression (regression on group means)  Number of obs     =     23,826
Group variable: regno                           Number of groups  =      2,166
R-sq:                                           Obs per group:
     within  = 0.0063                                         min =         11
     between = 0.1042                                         avg =       11.0
     overall = 0.0925                                         max =         11
                                                F(6,2159)         =      41.86
sd(u_i + avg(e_i.))=  .9109813                  Prob > F          =     0.0000
------------------------------------------------------------------------------
        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      orgage |   .0035048   .0004791     7.32   0.000     .0025652    .0044443
      localc |  -.3282906   .0414364    -7.92   0.000      -.40955   -.2470312
        west |   .1167392   .0805284     1.45   0.147     -.041182    .2746604
     genchar |  -.3210918   .0449127    -7.15   0.000    -.4091685    -.233015
    nsources |   .1384596   .0193749     7.15   0.000     .1004643    .1764549
govern_share |   .0002749   .0007478     0.37   0.713    -.0011916    .0017415
       _cons |   14.85058   .0835091   177.83   0.000     14.68681    15.01435
------------------------------------------------------------------------------

Estimating a Between Effects model is equivalent to collapsing the data and estimating your regression model on the resulting observations:

preserve
    collapse (mean) linc orgage localc west genchar nsources govern_share, by(regno)
    regress linc orgage localc west genchar nsources govern_share
    est store coll
restore
      Source |       SS           df       MS      Number of obs   =     2,166
-------------+----------------------------------   F(6, 2159)      =     41.86
       Model |  208.427331         6  34.7378885   Prob > F        =    0.0000
    Residual |  1791.72593     2,159  .829886952   R-squared       =    0.1042
-------------+----------------------------------   Adj R-squared   =    0.1017
       Total |  2000.15326     2,165  .923858319   Root MSE        =    .91098
------------------------------------------------------------------------------
        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      orgage |   .0035048   .0004791     7.32   0.000     .0025652    .0044443
      localc |  -.3282906   .0414364    -7.92   0.000      -.40955   -.2470313
        west |   .1167393   .0805284     1.45   0.147    -.0411819    .2746605
     genchar |  -.3210918   .0449127    -7.15   0.000    -.4091685    -.233015
    nsources |   .1384596   .0193749     7.15   0.000     .1004643     .176455
govern_share |   .0002749   .0007478     0.37   0.713    -.0011916    .0017415
       _cons |   14.85058   .0835091   177.83   0.000     14.68681    15.01435
------------------------------------------------------------------------------

est table pols beff coll
-----------------------------------------------------
    Variable |    pols         beff         coll     
-------------+---------------------------------------
      orgage |  .00360282    .00350476    .00350476  
      localc | -.33024344    -.3282906   -.32829062  
        west |  .11218649    .11673923    .11673928  
     genchar | -.31703032   -.32109178   -.32109176  
    nsources |  .10538836     .1384596    .13845961  
govern_share |  .00064402    .00027494    .00027494  
       _cons |  14.963168    14.850581    14.850581  
-----------------------------------------------------

Benefits of Between Effects

  • Sidesteps the problem of interdependence of observations in the original panel data.

  • Smooths the effect of anomalous time periods (e.g., excess deaths calculation).

  • Controls for omitted variables that change over time but are constant between units (e.g., national policies).

Limitations of Between Effects

What might the limitations of this approach be?

  • Cannot estimate observed variables that change over time but are constant between units (e.g., national policies).

  • Discard a lot of information by examining mean outcomes and inputs e.g., change over time.

  • Cannot control for unobserved explanatory variables that are constant within but vary between units e.g., organisational culture.

The limitations of the Between Effects model far outweigh the benefits in most cases, and thus it is not widely used in practice (Mehmetoglu and Jakobsen, 2016). However it plays a crucial role in the estimation of another panel data model — Random Effects model — and thus it is important to understand how it works and what it offers.

Summary

Both the Pooled OLS and Between Effects models provide useful information on the association between an outcome Y and a set of explanatory variables X.

However both can be suboptimal from a substantive perspective (no change over time).

More concerningly, they offer no ability to control for residual heterogeniety in the form of unobserved time-invariant explanatory variables.