Introduction to Longitudinal Data

This section draws heavily on the work of Professor Vernon Gayle: Longitudinal Data Analysis for Social Scientists

What are longitudinal data?

At its simplest, longitudinal data contain a temporal dimension. This may be as simple as the data set containing variables that define the beginning and end of a social process (e.g., how long did somebody remain unemployed?). More often when we speak of longitudinal data we refer to data sets containing multiple observations of the same individuals.

Types of longitudinal study designs

Repeated cross-sectional studies

Repeat samples of the same population over time:

  • National Surveys of Sexual Attitudes and Lifestyles (NATSAL)

  • British Social Attitudes Survey

Repeated cross-sectional studies allow analysis of change over time at the aggregate / macro level. For example, the mean number of opposite-sex sexual partners has increased over time in the UK for both men and women:

Figure 1.1.

NATSAL Sexual Partners

Credit: Wellcome Trust/Paulo Estriga

Panel study

Groups of entities are repeatedly studied over time:

  • UK Household Longitudinal Study (UKHLS)

  • Panel Study of Income Dynamics (PSID)

  • English Longitudinal Study of Aging (ELSA)

Panel studies collect data on the same respondents over time, and thus are known as repeated contacts data. For example, PSID has a module examining charitable giving of US households since 2000; this information is collected biennially and allows us to understand how the same households alter their giving behaviour over time (see figure 3.2 below).

Figure 1.2

PSID Charitable Giving

Credit: Changes to the Giving Landscape

Cohort study

Following a particular group of entities over time:

  • Millennium Cohort Study

  • Growing Up in Scotland

  • Whitehall Study II

The Millennium Cohort Study is a multi-wave survey of almost 20,000 children born in the UK during 2000/01, and is a representative sample of all children born during this period (Rafferty et al., 2015). It collects data at different periods (waves) of the children’s lives, thus providing longitudinal information on the development and life histories of these children.

Figure 1.3.

MCS

Credit: Child overweight and obesity: Initial findings from the Millennium Cohort Study Age 14 Survey

Why use longitudinal data?

  • UK has an unparalleled collection of longitudinal data resources.

  • These resources are critical for analysing social change (and social stability).

  • However they are costly to collect, clean and share, therefore strong justification needed.

Answering research questions

For many social science research projects cross-sectional data will be sufficient.

For example, if we are interested in understanding regional inequalities, it is sufficient to take a cross-section of data for these regions (e.g., a single census year) and describe variation in some measure of inequality. One of my recent research projects examined the distribution of charities across local authorities in England and Wales:

Figure 1.4.

Charity Density Map

The map displays the mean number of charities per 5000 residents across 326 local authorities in England and Wales. In essence I combined five census years to produce a cross-section of charity density between 1971 and 2011; that is I ignored the longitudinal component of my data and focused instead on making comparisons between local authorities.

Most social research projects can be improved by the analysis of longitudinal data.

Figure 1.5.

Charity Density and Deprivation

Figure 3.5 presents the temporal variation in the association between charity density and the level of deprivation in a local authority. Not only can we make comparisons between local authorities in a given year, we can now examine change over time, adding much more detail to our understanding of the relationship between density and deprivation.

Some research questions require longitudinal data.

Figure 1.6.

Charity Density Change Score Model

Figure 3.3 displays the results of a change score model that links changes in the values of a set of explanatory variables to changes in the values of the outcome. For example, a local authority becoming more deprived between census years is associated with a small increase in charity density. Such an analysis is not possible if we did not have data on the same local authorities at multiple time periods.

Research questions that require longitudinal data:

  • Flows into and out of poverty.

  • The effects of family migration on the woman’s subsequent employment activities.

  • The impact of Covid-19 on long-term health outcomes of individuals.

  • Evaluating policy, health, educational interventions.

Methodological benefits

Micro-level social processes

Repeated cross-sectional data can reveal macro-level trends and patterns of substantive interest but mask micro-level change. For example, repeated cross-sectional analysis of the British Household Panel Survey (precursor to Understanding Society) showed that poverty rates stabilised in the 1990s. However longitudinal analysis uncovered substantial turnover / churn in terms of which individuals remained in or exited poverty (the poor are not always poor!).

In my own research area, cross-sectional analysis of the Scottish Household Survey reveals the proportion of individuals volunteering has remained stable between 2007-2017 (Volunteer Scotland, 2019). However this pattern masks the substantial micro-level variation in volunteering behaviour: that is, it is not the same individuals volunteering every year, with people dipping in and out of this activity throughout the lifecourse.

Temporal ordering of events

Longitudinal data give us a better sense of the timing of events and hence the direction of influence. Remember that a necessary (but insufficient) condition for causal analysis is the appropriate temporal ordering of the cause and effect: X cannot cause Y if it does not occur before Y.

Understanding — and having the ability to identify — the temporal ordering of events helps to address a pervading issue in quantitative social science analysis: simultaneity bias. For example, it is difficult to untangle whether poor health causes unemployment, unemployment causes poor health (or both) without some form of longitudinal data.

Improving control for residual heterogeneity

Now we arrive at one of the major methodological appeals of longitudinal data: the ability to control for residual heterogeneity. As Gayle (2018) concisely states:

The possibility of substantial variation between similar individuals due to unmeasured, and possibly immeasurable, variables is known as ‘residual heterogeneity’.

You may have heard residual heterogeneity referred to as omitted variable bias or unobserved hetereogeneity. We’ll spend much more time on this benefit in the next section.

Improving control for state dependence

Longitudinal data provide important information on the initial or current state an entity is in, and the trajectory of said entity across different or the same states over time. As Nobel Prize winner J.J. Heckman summarises:

A frequently notes empirical regularity in the analysis of employment data is that those who were unemployed in the past or have worked in the past are more likely to be unemployed (or working) in the future.

In essence, much of human behaviour is influenced by previous behaviour and outcomes. Think back to the example we showed from the Millennium Cohort Study: both boys and girls were most likely to remain at the same weight (whether normal or excess) at age 14 as they were at age 11.

A note of caution

Longitudinal data are not a panacea:

  • For missing data

  • For measurement error

  • For lack of sample representativeness

  • For poorly specified statistical models

  • Etc

See the excellent summaries of the strengths and weaknesses of longitudinal data produced by CLOSER.

In summary

Longitudinal data enhance our ability to investigate complicated processes in the social world!

What does longitudinal data look like?

Let’s get our hands dirty working with some real-world longitudinal data: strictly speaking I’ll get my hands dirty as the data set we’re using has some restrictions on sharing. We will explore a data set containing a representative sample of UK charities: a version of this data set is available through the UK Data Service: SN 853257

First, let’s start with a simple, fabricated example of a longitudinal data set.

import delimited using "./data/lda-simple-example-2020-08-28.csv", clear varn(1)
l
(5 vars, 20 obs)
     +--------------------------------------+
     |   pid   year      sex   age   income |
     |--------------------------------------|
  1. | 10001   2015     male    22    20000 |
  2. | 10001   2016     male    23    20000 |
  3. | 10001   2017     male    24    22000 |
  4. | 10001   2018     male    25    24000 |
  5. | 10002   2015   female    45    29000 |
     |--------------------------------------|
  6. | 10002   2016   female    46    29000 |
  7. | 10002   2017   female    47    29000 |
  8. | 10002   2018   female    48    29500 |
  9. | 10003   2015   female    31    41500 |
 10. | 10003   2016   female    32    42400 |
     |--------------------------------------|
 11. | 10003   2017   female    33    43800 |
 12. | 10003   2018   female    34    45000 |
 13. | 10004   2015     male    65    25000 |
 14. | 10004   2016     male    66    10000 |
 15. | 10004   2017     male    67    10000 |
     |--------------------------------------|
 16. | 10004   2018     male    68    10000 |
 17. | 10005   2015   female    18    14000 |
 18. | 10005   2016   female    19    15000 |
 19. | 10005   2017   female    20    15000 |
 20. | 10005   2018   female    21    18000 |
     +--------------------------------------+

Here we have five individuals (units) observed across four years (time periods), with three variables capturing attributes in each year (sex, age, income).

This is an example of a balanced panel: the same number of observations is captured for each unit.

Now let’s look at a different example:

import delimited using "./data/lda-simple-example-ub-2020-08-28.csv", clear varn(1)
l
(5 vars, 16 obs)
     +--------------------------------------+
     |   pid   year      sex   age   income |
     |--------------------------------------|
  1. | 10001   2015     male    22    20000 |
  2. | 10001   2016     male    23    20000 |
  3. | 10001   2017     male    24    22000 |
  4. | 10001   2018     male    25    24000 |
  5. | 10002   2015   female    45    29000 |
     |--------------------------------------|
  6. | 10002   2016   female    46    29000 |
  7. | 10003   2015   female    31    41500 |
  8. | 10003   2016   female    32    42400 |
  9. | 10003   2017   female    33    43800 |
 10. | 10004   2015     male    65    25000 |
     |--------------------------------------|
 11. | 10004   2016     male    66    10000 |
 12. | 10004   2017     male    67    10000 |
 13. | 10004   2018     male    68    10000 |
 14. | 10005   2015   female    18    14000 |
 15. | 10005   2016   female    19    15000 |
     |--------------------------------------|
 16. | 10005   2017   female    20    15000 |
     +--------------------------------------+

Here we have the same units and time span but this time there are gaps within units: individual 10002 is only observed twice, and 10003 and 10005 three times.

This is an example of an unbalanced panel: the same number of observations is not captured for each unit.

Working with a balanced panel is preferrable for a number of reasons, which we’ll explore in due course. However the methods of analysis we will cover apply to unbalanced panels also (Mehmetoglu & Jakobsen, 2016).

The classic panel consists of a large number of units of analysis (i) observed over a small number of periods (t).

Charity data

use "./data/charity-panel-2020-09-10.dta", clear
desc
(Contains annual accounts of charities in E&W for financial years 2006-2017)
Contains data from ./data/charity-panel-2020-09-10.dta
  obs:        68,818                          Contains annual accounts of
                                                charities in E&W for financial
                                                years 2006-2017
 vars:            31                          9 Sep 2020 08:41
 size:     8,326,978                          (_dta has notes)
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
regno           long    %12.0g                Charity number (unique id)
fin_year        byte    %8.0g      fin_year   Financial year
etotal          double  %12.0g                Total expenditure
itotal          double  %12.0g                Total income
aob_classified  str19   %19s                  Geographical scale of activity
                                                i.e. local, national
sampling_strata byte    %12.0g     sampling_strata_lab
                                              Income categories used to sample
                                                organisations
large_samplin~a byte    %12.0g     large_sampling_strata_lab
                                              Income categories used to sample
                                                large organisations (£500k+)
orgsize         byte    %12.0g     orgsize_lab
                                              Size of charity - in categories of
                                                total annual gross income
orgsize_large   byte    %12.0g     orgsize_large_lab
                                              Organisation size by income bands,
                                                for large charities (> £500k)
orgsize_alt     byte    %13.0g     orgsize_alt_lab
                                              Organisation size by income bands,
                                                alternative banding
fundraised      float   %9.0g                 Income derived from donations from
                                                individuals
ind_fees        float   %9.0g                 Income derived from fees for
                                                charitable activities from
                                                individuals
govern          float   %9.0g                 Income derived from government
                                                grants or contracts
volsector       float   %9.0g                 Income derived from voluntary
                                                sector grants or contracts
internal        float   %9.0g                 Income derived from investments
                                                and trading subsidiaries
business_other  float   %9.0g                 Income derived from other sources
                                                e.g. business sector
fundraised_sh~e float   %9.0g                 Share of income derived from
                                                donations from individuals
business_othe~e float   %9.0g                 Share of income derived from other
                                                sources e.g. business sector
internal_share  float   %9.0g                 Share of income derived from
                                                investments and trading
                                                subsidiaries
volsector_share float   %9.0g                 Share of income derived from
                                                voluntary sector grants or
                                                contracts
govern_share    float   %9.0g                 Share of income derived from
                                                government grants or contracts
ind_fees_share  float   %9.0g                 Share of income derived from fees
                                                for charitable activities from
                                                individuals
nsources        byte    %9.0g                 Number of income sources where
                                                income >= £1,000
inc_diverse     float   %9.0g                 Index of revenue diversification:
                                                0 (less diversified) to 1 (more
                                                diversified)
maxyear         byte    %9.0g                 Most recent year charity appears
                                                in the dataset
orgage          int     %9.0g                 Age of charity - in years
linc            float   %9.0g                 Total income (log)
genchar         float   %9.0g                 General charity
socser          float   %9.0g                 Social service charity
west            float   %9.0g                 Charity registered in Westminster
localc          float   %9.0g                 Local charity
--------------------------------------------------------------------------------
Sorted by: regno

Let’s perform a couple of quick tasks in order to get familiar with the data.

First, we need to tell Stata we are dealing with panel data, as this allows us to access some time-series operators that are useful:

xtset regno fin_year
       panel variable:  regno (unbalanced)
        time variable:  fin_year, 1 to 11, but with gaps
                delta:  1 unit

The xtset command takes two arguments: a variable representing the unique identifier of the panel units (regno) and a variable capturing the unique identifier for the time period (fin_year). This combination of variables must uniquely identify every observation (row) in the data: we can check whether this is the case using the isid command - if no error message is returned, then those variables uniquely identify an every observation:

isid regno fin_year

Second, we can use xtdescribe to learn more about the patterns of observations in our panel:

xtdescribe
   regno:  200048, 200051, ..., 1166968                      n =      11193
fin_year:  1, 2, ..., 11                                     T =         11
           Delta(fin_year) = 1 unit
           Span(fin_year)  = 11 periods
           (regno*fin_year uniquely identifies each observation)
Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       3         6        10      11      11
     Freq.  Percent    Cum. |  Pattern
 ---------------------------+-------------
     2166     19.35   19.35 |  11111111111
      476      4.25   23.60 |  ..111111111
      434      3.88   27.48 |  ....1.1.1.1
      388      3.47   30.95 |  ........1.1
      381      3.40   34.35 |  ......1.1.1
      247      2.21   36.56 |  ....1......
      212      1.89   38.45 |  .......1.1.
      211      1.89   40.34 |  ......1....
      181      1.62   41.95 |  1111.......
     6497     58.05  100.00 | (other patterns)
 ---------------------------+-------------
    11193    100.00         |  XXXXXXXXXXX

Let’s unpack these results:

  • There are 11,193 panel units (n) and 11 time periods (T).

  • The time period variable (fin_year) changes by 1 unit (Delta(fin_year)).

  • 50% of panel units are observed at least 6 times (Distribution of T_i).

  • 2,166 panel units are observed in every time period, 181 are observed only in the first 4 periods etc (see frequency table).

by regno: gen numobs = _N
xttab numobs
                  Overall             Between            Within
   numobs |    Freq.  Percent      Freq.  Percent        Percent
----------+-----------------------------------------------------
        1 |    1318      1.92      1318     11.78         100.00
        2 |    2838      4.12      1419     12.68         100.00
        3 |    3069      4.46      1023      9.14         100.00
        4 |    3812      5.54       953      8.51         100.00
        5 |    2895      4.21       579      5.17         100.00
        6 |    3624      5.27       604      5.40         100.00
        7 |    4081      5.93       583      5.21         100.00
        8 |    4552      6.61       569      5.08         100.00
        9 |    8883     12.91       987      8.82         100.00
       10 |    9920     14.41       992      8.86         100.00
       11 |   23826     34.62      2166     19.35         100.00
----------+-----------------------------------------------------
    Total |   68818    100.00     11193    100.00         100.00
                             (n = 11193)

Now we have a better sense of the number of times we observe our panel units in the data. Let’s also create a variable that identifies charities that appear in every year in the data, and drop all charities that do not meet this criterion:

gen balpan = (numobs==11)
keep if balpan
(44,992 observations deleted)

That will do for now, we’ll examine the variables when we start estimating statistical models in the next section. We’ll save the changes to the data set:

sav "./data/charity-panel-analysis-2020-09-10.dta", replace
file ./data/charity-panel-analysis-2020-09-10.dta saved

Summary

Longitudinal data offer a number of substantive and methodological benefits.

There a number of study designs, each with strengths and weaknesses.

Longitudinal data are not a panacea.