Panel Data Analysis II¶

In this section we estimate our first set of statistical models using panel data: Pooled OLS and Between Effects. We show some examples of how to estimate and interpret these models, and reflect on the conditions under which the models are appropriate.

What we can relax about¶

In the sessions demonstrating how to quantitatively analyse panel data, we will cast aside the following concerns:

Missing data
Weights
Attrition
Multicollinearity

All of these issues impinge on the estimation of panel data models but are not necessary to address for the purposes of learning about said models. We encourage you to consult the reading list for suggestions of resources that cover these topics.

Defining our statistical model¶

Now we arrive at the interesting bit: estimating statistical models.

Let’s return to our panel data on charities and define a statistical model for predicting a charity’s annual gross income as a function of its age, the scale of its charitable activities, where it is located, what type of charity it is, and the number of sources of income it has, and the share of its income provided by government.

\[ \text{y}_{it} = \beta_0 + \beta_1x_{1it} + \beta_2x_{2i} + \beta_3x_{3i} + \beta_4x_{4i} + \beta_5x_{5it} + \beta_6x_{6it} + \epsilon_{it} \tag{1.6} \]

Where:

\(\text{y}_{it}\) is log income for charity i at time t

\(\beta_0\) is the constant term, which is our prediction for log income when the values of all other variables in the model are set to 0

\(\text{x}_{1it}\) captures the age of charity i at time t, and \(\beta_1\) is the effect of this variable on the outcome

\(\text{x}_{2i}\) is a dummy variable identifying charities that operate at a local level

\(\text{x}_{3i}\) is a dummy variable identifying charities registered in Westminster

\(\text{x}_{4i}\) is a dummy variable identifying general charities

\(\text{x}_{5it}\) captures the number of sources of income for charity i at time t

\(\text{x}_{6it}\) captures the share its income charity i derives from government sources at time t

\(\epsilon_{it}\) captures the residual for charity i at time t (\(\text{y}_{it} - \hat{y}_{it}\))

Understanding sources of variation¶

Remember to keep in mind the two sources of variation that exist in panel data (Gould, n.d.):

Cross-section information on differences between units
Time series information on differences over time within units

Data exploration¶

Let’s spend a little bit of time exploring the key variables in our statistical model.

use "./data/charity-panel-analysis-2020-09-10.dta", clear

(Contains annual accounts of charities in E&W for financial years 2006-2017)

Histogram of Log Income

Histogram of Government Funding Share

sum orgage, detail

                  Age of charity - in years

-------------------------------------------------------------

      Percentiles      Smallest

 1%            4              0

 5%            7              1

10%           10              1       Obs              23,826

25%           16              1       Sum of Wgt.      23,826

50%           27                      Mean           39.20129

                        Largest       Std. Dev.       42.4661

75%           48            496

90%           82            497       Variance       1803.369

95%          112            498       Skewness       4.595531

99%          180            499       Kurtosis       37.17673

sum nsources, detail

      Number of income sources where income >= £1,000

-------------------------------------------------------------

      Percentiles      Smallest

 1%            1              0

 5%            2              0

10%            2              1       Obs              23,826

25%            3              1       Sum of Wgt.      23,826

50%            4                      Mean           3.806724

                        Largest       Std. Dev.       1.24789

75%            5              6

90%            5              6       Variance       1.557228

95%            6              6       Skewness      -.1130695

99%            6              6       Kurtosis       2.425233

tab1 localc socser

-> tabulation of localc

      Local |

    charity |      Freq.     Percent        Cum.

------------+-----------------------------------

          0 |      8,756       36.75       36.75

          1 |     15,070       63.25      100.00

------------+-----------------------------------

      Total |     23,826      100.00

-> tabulation of socser

     Social |

    service |

    charity |      Freq.     Percent        Cum.

------------+-----------------------------------

          0 |     20,449       85.83       85.83

          1 |      3,377       14.17      100.00

------------+-----------------------------------

      Total |     23,826      100.00

Pooled OLS Model¶

The starting point for any statistical modelling of panel data is to estimate a Pooled OLS model (fancy way of saying linear regression).

The observations are “pooled”, which just means we ignore the nested nature of panel data. In other words we assume that each observation (i.e., row within a long format data set) is independent of other observations (Gayle and Lambert, 2018).

Fundamental problem of pooling observations (Gayle & Lambert, 2018, p. 58):

The model does not recognise that there are multiple contributions of data from the same individuals, and therefore, it estimates results as if there are many individuals who shared the same characteristics. This impacts upon the estimate of measures such as variances and standard errors.

regress linc orgage localc west genchar nsources govern_share
est store pols

      Source |       SS           df       MS      Number of obs   =    23,826

-------------+----------------------------------   F(6, 23819)     =    410.54

       Model |   2225.8864         6  370.981066   Prob > F        =    0.0000

    Residual |  21523.6961    23,819  .903635591   R-squared       =    0.0937

-------------+----------------------------------   Adj R-squared   =    0.0935

       Total |  23749.5825    23,825  .996834524   Root MSE        =     .9506

------------------------------------------------------------------------------

        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

      orgage |   .0036028     .00015    24.01   0.000     .0033087    .0038969

      localc |  -.3302434   .0130224   -25.36   0.000    -.3557682   -.3047187

        west |   .1121865   .0253139     4.43   0.000     .0625697    .1618033

     genchar |  -.3170303   .0139082   -22.79   0.000    -.3442913   -.2897693

    nsources |   .1053884   .0050963    20.68   0.000     .0953993    .1153774

govern_share |    .000644   .0002035     3.16   0.002     .0002451    .0010429

       _cons |   14.96317   .0236406   632.94   0.000     14.91683    15.00951

------------------------------------------------------------------------------

Conditions where Pooled OLS is suitable¶

Pooled OLS can produce consistent estimates of the explanatory variables if:

The model is correctly specified
The explanatory variables are uncorrelated with the error term (Cameron and Trivedi, 2010)

TASK: Do you think our statistical model is correctly specified, and there is no correlation between error term and explanatory variables?

In our statistical model of charity income, it is unlikely that the interpretation of the coefficients would change drastically if we addressed the under-estimation of the standard errors (the sample size is very large).

We’ll cover the various tests and checks we can perform to examine whether Pooled OLS model violates the independence of errors assumption in a later section.

Between Effects Model¶

Once again estimate a cross-sectional model (Pooled OLS). However this time we transform the data so that there is one observation per unit. As a result we end up modelling the mean of Y on the mean of our X variables.

xtreg linc orgage localc west genchar nsources govern_share, be
est store beff

Between regression (regression on group means)  Number of obs     =     23,826

Group variable: regno                           Number of groups  =      2,166

R-sq:                                           Obs per group:

     within  = 0.0063                                         min =         11

     between = 0.1042                                         avg =       11.0

     overall = 0.0925                                         max =         11

                                                F(6,2159)         =      41.86

sd(u_i + avg(e_i.))=  .9109813                  Prob > F          =     0.0000

------------------------------------------------------------------------------

        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

      orgage |   .0035048   .0004791     7.32   0.000     .0025652    .0044443

      localc |  -.3282906   .0414364    -7.92   0.000      -.40955   -.2470312

        west |   .1167392   .0805284     1.45   0.147     -.041182    .2746604

     genchar |  -.3210918   .0449127    -7.15   0.000    -.4091685    -.233015

    nsources |   .1384596   .0193749     7.15   0.000     .1004643    .1764549

govern_share |   .0002749   .0007478     0.37   0.713    -.0011916    .0017415

       _cons |   14.85058   .0835091   177.83   0.000     14.68681    15.01435

------------------------------------------------------------------------------

Estimating a Between Effects model is equivalent to collapsing the data and estimating your regression model on the resulting observations:

preserve
    collapse (mean) linc orgage localc west genchar nsources govern_share, by(regno)
    regress linc orgage localc west genchar nsources govern_share
    est store coll
restore

      Source |       SS           df       MS      Number of obs   =     2,166

-------------+----------------------------------   F(6, 2159)      =     41.86

       Model |  208.427331         6  34.7378885   Prob > F        =    0.0000

    Residual |  1791.72593     2,159  .829886952   R-squared       =    0.1042

-------------+----------------------------------   Adj R-squared   =    0.1017

       Total |  2000.15326     2,165  .923858319   Root MSE        =    .91098

------------------------------------------------------------------------------

        linc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

      orgage |   .0035048   .0004791     7.32   0.000     .0025652    .0044443

      localc |  -.3282906   .0414364    -7.92   0.000      -.40955   -.2470313

        west |   .1167393   .0805284     1.45   0.147    -.0411819    .2746605

     genchar |  -.3210918   .0449127    -7.15   0.000    -.4091685    -.233015

    nsources |   .1384596   .0193749     7.15   0.000     .1004643     .176455

govern_share |   .0002749   .0007478     0.37   0.713    -.0011916    .0017415

       _cons |   14.85058   .0835091   177.83   0.000     14.68681    15.01435

------------------------------------------------------------------------------

est table pols beff coll

-----------------------------------------------------

    Variable |    pols         beff         coll

-------------+---------------------------------------

      orgage |  .00360282    .00350476    .00350476

      localc | -.33024344    -.3282906   -.32829062

        west |  .11218649    .11673923    .11673928

     genchar | -.31703032   -.32109178   -.32109176

    nsources |  .10538836     .1384596    .13845961

govern_share |  .00064402    .00027494    .00027494

       _cons |  14.963168    14.850581    14.850581

-----------------------------------------------------

Benefits of Between Effects¶

Sidesteps the problem of interdependence of observations in the original panel data.
Smooths the effect of anomalous time periods (e.g., excess deaths calculation).
Controls for omitted variables that change over time but are constant between units (e.g., national policies).

Limitations of Between Effects¶

What might the limitations of this approach be?

Cannot estimate observed variables that change over time but are constant between units (e.g., national policies).
Discard a lot of information by examining mean outcomes and inputs e.g., change over time.
Cannot control for unobserved explanatory variables that are constant within but vary between units e.g., organisational culture.

The limitations of the Between Effects model far outweigh the benefits in most cases, and thus it is not widely used in practice (Mehmetoglu and Jakobsen, 2016). However it plays a crucial role in the estimation of another panel data model — Random Effects model — and thus it is important to understand how it works and what it offers.

Summary¶

Both the Pooled OLS and Between Effects models provide useful information on the association between an outcome Y and a set of explanatory variables X.

However both can be suboptimal from a substantive perspective (no change over time).

More concerningly, they offer no ability to control for residual heterogeniety in the form of unobserved time-invariant explanatory variables.

Longitudinal Data Analysis for Social Scientists