Introduction to Longitudinal Data¶

This section draws heavily on the work of Professor Vernon Gayle: Longitudinal Data Analysis for Social Scientists

What are longitudinal data?¶

At its simplest, longitudinal data contain a temporal dimension. This may be as simple as the data set containing variables that define the beginning and end of a social process (e.g., how long did somebody remain unemployed?). More often when we speak of longitudinal data we refer to data sets containing multiple observations of the same individuals.

Types of longitudinal study designs¶

Repeated cross-sectional studies¶

Repeat samples of the same population over time:

National Surveys of Sexual Attitudes and Lifestyles (NATSAL)
British Social Attitudes Survey

Repeated cross-sectional studies allow analysis of change over time at the aggregate / macro level. For example, the mean number of opposite-sex sexual partners has increased over time in the UK for both men and women:

Figure 1.1.

NATSAL Sexual Partners

Credit: Wellcome Trust/Paulo Estriga

Panel study¶

Groups of entities are repeatedly studied over time:

UK Household Longitudinal Study (UKHLS)
Panel Study of Income Dynamics (PSID)
English Longitudinal Study of Aging (ELSA)

Panel studies collect data on the same respondents over time, and thus are known as repeated contacts data. For example, PSID has a module examining charitable giving of US households since 2000; this information is collected biennially and allows us to understand how the same households alter their giving behaviour over time (see figure 3.2 below).

Figure 1.2

PSID Charitable Giving

Credit: Changes to the Giving Landscape

Cohort study¶

Following a particular group of entities over time:

Millennium Cohort Study
Growing Up in Scotland
Whitehall Study II

The Millennium Cohort Study is a multi-wave survey of almost 20,000 children born in the UK during 2000/01, and is a representative sample of all children born during this period (Rafferty et al., 2015). It collects data at different periods (waves) of the children’s lives, thus providing longitudinal information on the development and life histories of these children.

Figure 1.3.

MCS

Credit: Child overweight and obesity: Initial findings from the Millennium Cohort Study Age 14 Survey

Why use longitudinal data?¶

UK has an unparalleled collection of longitudinal data resources.
These resources are critical for analysing social change (and social stability).
However they are costly to collect, clean and share, therefore strong justification needed.

Answering research questions¶

For many social science research projects cross-sectional data will be sufficient.

For example, if we are interested in understanding regional inequalities, it is sufficient to take a cross-section of data for these regions (e.g., a single census year) and describe variation in some measure of inequality. One of my recent research projects examined the distribution of charities across local authorities in England and Wales:

Figure 1.4.

Charity Density Map

The map displays the mean number of charities per 5000 residents across 326 local authorities in England and Wales. In essence I combined five census years to produce a cross-section of charity density between 1971 and 2011; that is I ignored the longitudinal component of my data and focused instead on making comparisons between local authorities.

Most social research projects can be improved by the analysis of longitudinal data.

Figure 1.5.

Charity Density and Deprivation

Figure 3.5 presents the temporal variation in the association between charity density and the level of deprivation in a local authority. Not only can we make comparisons between local authorities in a given year, we can now examine change over time, adding much more detail to our understanding of the relationship between density and deprivation.

Some research questions require longitudinal data.

Figure 1.6.

Charity Density Change Score Model

Figure 3.3 displays the results of a change score model that links changes in the values of a set of explanatory variables to changes in the values of the outcome. For example, a local authority becoming more deprived between census years is associated with a small increase in charity density. Such an analysis is not possible if we did not have data on the same local authorities at multiple time periods.

Research questions that require longitudinal data:

Flows into and out of poverty.
The effects of family migration on the woman’s subsequent employment activities.
The impact of Covid-19 on long-term health outcomes of individuals.
Evaluating policy, health, educational interventions.

Methodological benefits¶

Temporal ordering of events¶

Longitudinal data give us a better sense of the timing of events and hence the direction of influence. Remember that a necessary (but insufficient) condition for causal analysis is the appropriate temporal ordering of the cause and effect: X cannot cause Y if it does not occur before Y.

Understanding — and having the ability to identify — the temporal ordering of events helps to address a pervading issue in quantitative social science analysis: simultaneity bias. For example, it is difficult to untangle whether poor health causes unemployment, unemployment causes poor health (or both) without some form of longitudinal data.

Improving control for residual heterogeneity¶

Now we arrive at one of the major methodological appeals of longitudinal data: the ability to control for residual heterogeneity. As Gayle (2018) concisely states:

The possibility of substantial variation between similar individuals due to unmeasured, and possibly immeasurable, variables is known as ‘residual heterogeneity’.

You may have heard residual heterogeneity referred to as omitted variable bias or unobserved hetereogeneity. We’ll spend much more time on this benefit in the next section.

Improving control for state dependence¶

Longitudinal data provide important information on the initial or current state an entity is in, and the trajectory of said entity across different or the same states over time. As Nobel Prize winner J.J. Heckman summarises:

A frequently notes empirical regularity in the analysis of employment data is that those who were unemployed in the past or have worked in the past are more likely to be unemployed (or working) in the future.

In essence, much of human behaviour is influenced by previous behaviour and outcomes. Think back to the example we showed from the Millennium Cohort Study: both boys and girls were most likely to remain at the same weight (whether normal or excess) at age 14 as they were at age 11.

A note of caution¶

Longitudinal data are not a panacea:

For missing data
For measurement error
For lack of sample representativeness
For poorly specified statistical models
Etc

See the excellent summaries of the strengths and weaknesses of longitudinal data produced by CLOSER.

In summary¶

Longitudinal data enhance our ability to investigate complicated processes in the social world!

What does longitudinal data look like?¶

Let’s get our hands dirty working with some real-world longitudinal data: strictly speaking I’ll get my hands dirty as the data set we’re using has some restrictions on sharing. We will explore a data set containing a representative sample of UK charities: a version of this data set is available through the UK Data Service: SN 853257

First, let’s start with a simple, fabricated example of a longitudinal data set.

import delimited using "./data/lda-simple-example-2020-08-28.csv", clear varn(1)
l

(5 vars, 20 obs)

     +--------------------------------------+

     |   pid   year      sex   age   income |

     |--------------------------------------|

  1. | 10001   2015     male    22    20000 |

  2. | 10001   2016     male    23    20000 |

  3. | 10001   2017     male    24    22000 |

  4. | 10001   2018     male    25    24000 |

  5. | 10002   2015   female    45    29000 |

     |--------------------------------------|

  6. | 10002   2016   female    46    29000 |

  7. | 10002   2017   female    47    29000 |

  8. | 10002   2018   female    48    29500 |

  9. | 10003   2015   female    31    41500 |

 10. | 10003   2016   female    32    42400 |

     |--------------------------------------|

 11. | 10003   2017   female    33    43800 |

 12. | 10003   2018   female    34    45000 |

 13. | 10004   2015     male    65    25000 |

 14. | 10004   2016     male    66    10000 |

 15. | 10004   2017     male    67    10000 |

     |--------------------------------------|

 16. | 10004   2018     male    68    10000 |

 17. | 10005   2015   female    18    14000 |

 18. | 10005   2016   female    19    15000 |

 19. | 10005   2017   female    20    15000 |

 20. | 10005   2018   female    21    18000 |

     +--------------------------------------+

Here we have five individuals (units) observed across four years (time periods), with three variables capturing attributes in each year (sex, age, income).

This is an example of a balanced panel: the same number of observations is captured for each unit.

Now let’s look at a different example:

import delimited using "./data/lda-simple-example-ub-2020-08-28.csv", clear varn(1)
l

(5 vars, 16 obs)

     +--------------------------------------+

     |   pid   year      sex   age   income |

     |--------------------------------------|

  1. | 10001   2015     male    22    20000 |

  2. | 10001   2016     male    23    20000 |

  3. | 10001   2017     male    24    22000 |

  4. | 10001   2018     male    25    24000 |

  5. | 10002   2015   female    45    29000 |

     |--------------------------------------|

  6. | 10002   2016   female    46    29000 |

  7. | 10003   2015   female    31    41500 |

  8. | 10003   2016   female    32    42400 |

  9. | 10003   2017   female    33    43800 |

 10. | 10004   2015     male    65    25000 |

     |--------------------------------------|

 11. | 10004   2016     male    66    10000 |

 12. | 10004   2017     male    67    10000 |

 13. | 10004   2018     male    68    10000 |

 14. | 10005   2015   female    18    14000 |

 15. | 10005   2016   female    19    15000 |

     |--------------------------------------|

 16. | 10005   2017   female    20    15000 |

     +--------------------------------------+

Here we have the same units and time span but this time there are gaps within units: individual 10002 is only observed twice, and 10003 and 10005 three times.

This is an example of an unbalanced panel: the same number of observations is not captured for each unit.

Working with a balanced panel is preferrable for a number of reasons, which we’ll explore in due course. However the methods of analysis we will cover apply to unbalanced panels also (Mehmetoglu & Jakobsen, 2016).

The classic panel consists of a large number of units of analysis (i) observed over a small number of periods (t).

Charity data¶

use "./data/charity-panel-2020-09-10.dta", clear
desc

(Contains annual accounts of charities in E&W for financial years 2006-2017)

Contains data from ./data/charity-panel-2020-09-10.dta

  obs:        68,818                          Contains annual accounts of

                                                charities in E&W for financial

                                                years 2006-2017

 vars:            31                          9 Sep 2020 08:41

 size:     8,326,978                          (_dta has notes)

--------------------------------------------------------------------------------

              storage   display    value

variable name   type    format     label      variable label

--------------------------------------------------------------------------------

regno           long    %12.0g                Charity number (unique id)

fin_year        byte    %8.0g      fin_year   Financial year

etotal          double  %12.0g                Total expenditure

itotal          double  %12.0g                Total income

aob_classified  str19   %19s                  Geographical scale of activity

                                                i.e. local, national

sampling_strata byte    %12.0g     sampling_strata_lab

                                              Income categories used to sample

                                                organisations

large_samplin~a byte    %12.0g     large_sampling_strata_lab

                                              Income categories used to sample

                                                large organisations (£500k+)

orgsize         byte    %12.0g     orgsize_lab

                                              Size of charity - in categories of

                                                total annual gross income

orgsize_large   byte    %12.0g     orgsize_large_lab

                                              Organisation size by income bands,

                                                for large charities (> £500k)

orgsize_alt     byte    %13.0g     orgsize_alt_lab

                                              Organisation size by income bands,

                                                alternative banding

fundraised      float   %9.0g                 Income derived from donations from

                                                individuals

ind_fees        float   %9.0g                 Income derived from fees for

                                                charitable activities from

                                                individuals

govern          float   %9.0g                 Income derived from government

                                                grants or contracts

volsector       float   %9.0g                 Income derived from voluntary

                                                sector grants or contracts

internal        float   %9.0g                 Income derived from investments

                                                and trading subsidiaries

business_other  float   %9.0g                 Income derived from other sources

                                                e.g. business sector

fundraised_sh~e float   %9.0g                 Share of income derived from

                                                donations from individuals

business_othe~e float   %9.0g                 Share of income derived from other

                                                sources e.g. business sector

internal_share  float   %9.0g                 Share of income derived from

                                                investments and trading

                                                subsidiaries

volsector_share float   %9.0g                 Share of income derived from

                                                voluntary sector grants or

                                                contracts

govern_share    float   %9.0g                 Share of income derived from

                                                government grants or contracts

ind_fees_share  float   %9.0g                 Share of income derived from fees

                                                for charitable activities from

                                                individuals

nsources        byte    %9.0g                 Number of income sources where

                                                income >= £1,000

inc_diverse     float   %9.0g                 Index of revenue diversification:

                                                0 (less diversified) to 1 (more

                                                diversified)

maxyear         byte    %9.0g                 Most recent year charity appears

                                                in the dataset

orgage          int     %9.0g                 Age of charity - in years

linc            float   %9.0g                 Total income (log)

genchar         float   %9.0g                 General charity

socser          float   %9.0g                 Social service charity

west            float   %9.0g                 Charity registered in Westminster

localc          float   %9.0g                 Local charity

--------------------------------------------------------------------------------

Sorted by: regno

Let’s perform a couple of quick tasks in order to get familiar with the data.

First, we need to tell Stata we are dealing with panel data, as this allows us to access some time-series operators that are useful:

xtset regno fin_year

       panel variable:  regno (unbalanced)

        time variable:  fin_year, 1 to 11, but with gaps

                delta:  1 unit

The xtset command takes two arguments: a variable representing the unique identifier of the panel units (regno) and a variable capturing the unique identifier for the time period (fin_year). This combination of variables must uniquely identify every observation (row) in the data: we can check whether this is the case using the isid command - if no error message is returned, then those variables uniquely identify an every observation:

isid regno fin_year

Second, we can use xtdescribe to learn more about the patterns of observations in our panel:

xtdescribe

   regno:  200048, 200051, ..., 1166968                      n =      11193

fin_year:  1, 2, ..., 11                                     T =         11

           Delta(fin_year) = 1 unit

           Span(fin_year)  = 11 periods

           (regno*fin_year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max

                         1       1       3         6        10      11      11

     Freq.  Percent    Cum. |  Pattern

 ---------------------------+-------------

     2166     19.35   19.35 |  11111111111

      476      4.25   23.60 |  ..111111111

      434      3.88   27.48 |  ....1.1.1.1

      388      3.47   30.95 |  ........1.1

      381      3.40   34.35 |  ......1.1.1

      247      2.21   36.56 |  ....1......

      212      1.89   38.45 |  .......1.1.

      211      1.89   40.34 |  ......1....

      181      1.62   41.95 |  1111.......

     6497     58.05  100.00 | (other patterns)

 ---------------------------+-------------

    11193    100.00         |  XXXXXXXXXXX

Let’s unpack these results:

There are 11,193 panel units (n) and 11 time periods (T).
The time period variable (fin_year) changes by 1 unit (Delta(fin_year)).
50% of panel units are observed at least 6 times (Distribution of T_i).
2,166 panel units are observed in every time period, 181 are observed only in the first 4 periods etc (see frequency table).

by regno: gen numobs = _N
xttab numobs

                  Overall             Between            Within

   numobs |    Freq.  Percent      Freq.  Percent        Percent

----------+-----------------------------------------------------

        1 |    1318      1.92      1318     11.78         100.00

        2 |    2838      4.12      1419     12.68         100.00

        3 |    3069      4.46      1023      9.14         100.00

        4 |    3812      5.54       953      8.51         100.00

        5 |    2895      4.21       579      5.17         100.00

        6 |    3624      5.27       604      5.40         100.00

        7 |    4081      5.93       583      5.21         100.00

        8 |    4552      6.61       569      5.08         100.00

        9 |    8883     12.91       987      8.82         100.00

       10 |    9920     14.41       992      8.86         100.00

       11 |   23826     34.62      2166     19.35         100.00

----------+-----------------------------------------------------

    Total |   68818    100.00     11193    100.00         100.00

                             (n = 11193)

Now we have a better sense of the number of times we observe our panel units in the data. Let’s also create a variable that identifies charities that appear in every year in the data, and drop all charities that do not meet this criterion:

gen balpan = (numobs==11)
keep if balpan

(44,992 observations deleted)

That will do for now, we’ll examine the variables when we start estimating statistical models in the next section. We’ll save the changes to the data set:

sav "./data/charity-panel-analysis-2020-09-10.dta", replace

file ./data/charity-panel-analysis-2020-09-10.dta saved

Summary¶

Longitudinal data offer a number of substantive and methodological benefits.

There a number of study designs, each with strengths and weaknesses.

Longitudinal data are not a panacea.

Longitudinal Data Analysis for Social Scientists