SCS: Workshops: Longitudinal and Hierarchical Data Analysis with Mixed Models 2007

From MathWiki

Course web page:

Table of contents


This course uses classical repeated measure, univariate and multivariate, as a point of departure for studying methods for the analysis of longitudinal and hierarchical data using mixed models. Mixed models allow the analysis of repeated measures data when the data are ‘unbalanced’ and classical models do not work, e.g. subjects are observed at different times or time-varying covariates are included in the model. The ability to analyze a wider range of data comes at a price. Not only do you need to learn new techniques, you also need to become aware of concepts that are not as salient in the analysis of ‘balanced’ data.

The course will emphasize the visualization of the basic concepts to help you develop a good understanding of the strengths and limitations of these methods. The proposed list of topics includes: Classical univariate and multivariate repeated measures models, extensions to mixed models. The structure of the linear mixed model: fixed effects, random effects, variance and covariance components. How mixed models are used to fit longitudinal data. Statistical control with observational data. Borrowing strength, shrinkage and bias in random effects models. Contextual versus compositional effects. Model building and diagnostics. Consequences of measurement error and approaches to adjustment. Modelling correlation. Missing data patterns. Modelling panel attrition. Logistic regression for binary responses. Non-linear models for binary and categorical responses.

Two multilevel statistics books are available free online: The Internet Edition of Harvey Goldstein’s Extensive but challenging Multilevel Statistical Models can be downloaded from

Applied Multilevel Analysis by Joop Hox is available at

What is a Mixed Model?

The last two decades have seen rapid growth in the development and use of models suitable for multilevel or longitudinal data. We can identify at least four broad approaches: the use of derived variables, econometric models, latent trajectory models using structural equations models and mixed (fixed and random effects) models. This course focuses on the use of mixed models for multilevel and longitudinal data. Mixed models have a wide potential for applications to data that are otherwise awkward or impossible to model. Some key applications are to nested data structures, e.g. students within classes within schools within school boards, and to longitudinal data, especially ‘messy’ data where measurements are taken at irregular times, e.g. clinical data from patients measured at irregular times or panel data with changing membership. Another application is to panel data with time-varying covariates, e.g. age where each cohort has a variety of ages and the researcher is interested in studying age effects.

Newer methods are needed for:

  1. Balanced data with missing occasions
  2. Observations at irregular times: recovery of IQ after brain injury: Pine trees, comas and migraines
  3. Time-varying covariates: Schizophrenic symptoms over time (p. 18)
  4. Non-normal outcome Daily subject log recording presence or absence of migraine: Pine trees, comas and migraines (p. 93ff).

Classical methods vs. mixed models: Mixed models give greater flexibility: missing data, time-varying covariates, different times for different subjects. At cost of having to understand possible consequences See UCLA:Longitudinal Data Analysis Using Multilevel Models

Review of Regression

We begin by reviewing a few concepts from regression and multivariate data analysis. Our emphasis is on concepts that are frequently omitted, or, at least, not well explored in typical courses.

The meaning of Regression Coefficients

Is β the effect of X1 ? 1Consider the familiar regression equation:

E(Y) = \beta_0  + \beta_1 X_1 + \beta_2 X_2 +  \cdots +\beta_p X_p

To make this more concrete, suppose we are studying the relationship between Y=Health, X1 =Weight and X 2 = Height .

E(Health) = β0 + β1Weight + β2Height

so that β1 is the ‘effect’ of changing Weight . What if we are really interested in the ‘effect’ of ExcessWeight . Maybe we should replace Weight with ExcessWeight . Let’s suppose that

ExcessWeight = Weight − (φ0 + φ1Height)

where φ0 + φ1Height is the 'normal' Weight for a given Height. What happens if we fit the model?

E(Health) = γ0 + γ1ExcessWeight + γ2Height

where we now expect γ1 to be the effect of ExcessWeight. If we substitute the original definition for ExcessWeight into the model we get:

E(Health) = γ0 + γ1ExcessWeight + γ2Height
E(Health) = γ0 + γ1(Weight − (φ0 + φ1Height) + γ2Height
E(Health) = (γ0 − γ1φ0) + γ1Weight + (γ2 − γ1φ1)Height

Using the fact that identical linear functions have identical coefficients:

γ0 = β0 + β1φ0
γ1 = β1
γ2 = β2 + β1φ1

The coefficient we expected to change is the only one that stays the same! This illustrates the importance of thinking of β1 as the effect of changing Weight keeping Height constant, which turns out to be the same thing as changing ExcessWeight keeping Height constant as long as ExcessWeight can be defined as a linear function of Weight and Height. In multiple regressions, the other variables in the model are as important in defining the meaning of a coefficient as the variable to which the coefficient is attached. Often, the major role of the target variable is that of providing a measuring stick for units. This fact will play an important role when we consider controlling for ‘contextual variables.’

Visualizing Regression with Ellipses

  1. The bivariate normal and its contours: iso-density or contour ellipse. script: visualizing normal contour ellipses (
  2. Visualizing Multiple Regression R scripts
    1. Visualizing multiple regression with R part 1 (
    2. Visualizing multiple regression with R part 2 ( and Visualizing multiple regression part 2a (
    3. Visualizing multiple regression with R part 3 (
  3. Yet more properties: Statistics: Ellipses
  4. Even more properties: Deviation Ellipse and Precision Ellipse (
  5. Review of regression with SASUCLA SAS seminar (

Confidence regions and intervals for β


This sketch uses the confidence ellipse for β1 and β1 to summarize the relationship among the estimated coefficients and the three linear models one can fit with two predictor variables. Remember that the shape of the confidence ellipse is that of the 90 degree rotation of the data ellipse for the two predictors -- provided all plots are isometric.

The following are two views of the data showing the multiple regression plane and the plane for the simple regression of 'Heart' on 'Coffee'.

  1. Confidence ellipses and intervals in regression: Statistics: Ellipses of regression

Visualizing multivariate variance

Understanding multivariate variance is important to unravel some mysteries of random coefficient models. With univariate random variables, it is generally easier to think in terms of the standard deviation SD = \sqrt{VAR}. What is the equivalent of the SD for a bivariate random vector? The easiest way of thinking about this uses the standard 'concentration ellipse' which may also be known as the 'dispersion ellipse' or the 'variance ellipse' for the bivariate normal. with a sample, we can define a standard sample data ellipse with the following properties:

  • The centre of the 'standard ellipse' is at the point of means.
  • The shadow of the ellipse onto either axis gives mean of \pm 1 SD.
  • The shadow of the ellipse onto ANY axis gives mean \pm 1 SD for a variable represented by projection onto the axis, i.e. any linear combination of X1 and X2.
  • The vertical slice through the centre of the ellipse gives the residual SD of X_@ given X_1 have a multivariate normal distribution, this is also conditional SD of X_2 given X_1.
  • Mutatis mutandis for X1 given X2.
  • The line through the points of vertical tangency is the regression line of X2 on X1.
  • Mutatis mutandis for X1 on X2.

The fact that the regression line goes through the points of vertical tangency and NOT the corners of the rectangle containing the ellipse (NOR the principal axis of the ellipse) is the essence of the regression paradox. Indeed, the very word regression comes from the notion of 'regression to mediocrity' embodied in the fact that the regression line is flatter than the 'SD line'
  • The bivariate SD: Take any pairs of conjjugate axes ont he ellipse as shown in the figure above. Make them column vectors in a 2X2 matrix and you have the 'square root' of the variance matrix. i.e. the SD for the bivariate random vector. The choice of conjugate axes is not unique - neither is the 'square root' of a matrix. Depending on your choice of conjugate axes, you can get various factorizations of the variance matrix: upper or lower Choleski, symmetric non-negative definite 'square root', principal components factorization (singular value decompositionon), etc.

Matrix formulation of regression

Y = X\beta + \varepsilon is such a universal and convenient shorthand that we need to spell out what it means and how it is used.

Consider the equation for a single observation assuming n X variables:

Y_i = \beta_0 + x_{1i}\beta + x_{2i}\beta + x_{ni}\beta + \varepsilon_i

When we combine the equation for each observation into the data set, with p coefficients and the regression intercept β0 and n data points (sample size), with n\geq (p+1) allows construction of the following vectors and matrix with associated standard errors:

\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_j \\ \vdots \\ \beta_J \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_j\\ \vdots\\ \varepsilon_J \end{bmatrix}

note that the first column of 1's are implicit with the β0

or, in short-hand:

\ y =  \mathbf{X}\cdot\beta + \varepsilon.\,

In multilevel models, say J schools indexed by j=1,2,....J and with the jth school having nj students, we block students of the same school together. We can then write for the jth school:

\begin{bmatrix} y_{1j} \\ y_{2j} \\ \vdots \\ y_{n_{i}j} \end{bmatrix} = \begin{bmatrix} 1 & x_{11j} & x_{12j} & \dots & x_{1n_{j}j} \\ 1 & x_{21j} & x_{22j} & \dots & x_{2n_{j}j} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0j \\ \beta_1j \\ \vdots \\ \beta_nj \end{bmatrix} + \begin{bmatrix} \varepsilon_1j\\ \varepsilon_2j\\ \vdots\\ \varepsilon_{n_{j}j} \end{bmatrix}

Note that in y_{n_{i}j} the n has the subscript i because each school can have a different number of data points.

Or, in short hand:

\ Y_j =  \mathbf{X_j}\cdot\beta_j + \varepsilon_j.\,

We can then stack schools on top of each other. If all schools are assumed to have the same value for \mathbf{\beta_j} = \mathbf{\beta}, then we can stack the Xs vertically:

\begin{bmatrix} \mathbf{Y}_{1} \\ \mathbf{Y}_{j} \\ \vdots \\ \mathbf{Y}_{J} \end{bmatrix} = \begin{bmatrix} \mathbf{X}_{1}\\ \mathbf{X}_{j}\\ \vdots\\ \mathbf{X}_{J} \end{bmatrix} \mathbf{\beta} \begin{bmatrix} \varepsilon_1\\ \varepsilon_j\\ \vdots\\ \varepsilon_J \end{bmatrix}

Or, in short hand:

\ \mathbf{Y} =  \mathbf{X}\cdot\beta + \varepsilon.\,

This is essentially the least squares formula, except that the \varepsilon are not independent.

If the βjs are different, we need to stack the Xjs diagonally:

\begin{bmatrix} \mathbf{Y}_{1} \\ \vdots\\ \mathbf{Y}_{j} \\ \vdots \\ \mathbf{Y}_{J} \end{bmatrix} = \begin{bmatrix} \mathbf{X}_{1} & \dots & 0 & \dots & 0 \\ \vdots && \vdots && \vdots \\0 & \dots & \mathbf{X}_{j} & \dots & 0 \\ \vdots && \vdots && \vdots \\ 0 & \dots & 0 & \dots & \mathbf{X}_{J}\end{bmatrix} \begin{bmatrix} \mathbf{\beta}_1 \\ \beta_j \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix}

If \varepsilon ~ N\sim(0,\sigma^2I),

the estimated values of the parameters can be given as
\widehat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T {Y}

If the components of \varepsilon are not iid but \varepsilon ~ N\sim(0,\Sigma) where Σ is a known variance matrix (or, at least, known up to a proportional factor) then the GLS (generalized least squares) estimator is

\widehat{\beta}=(\mathbf{X}^T\mathbf{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\Sigma}^{-1} {Y}

A motivating example for multilevel modeling

The data

For multilevel modeling we will use a subset of a data set presented at length in Bryk and Raudenbush (1992). The major variables are math achievement (mathach) and socio-economic status (ses) measured for each student in a sample of high school students in each of 40 selected schools in the U. S. Each school belongs to one of two sectors: 25 are Public schools and 15 are Catholic schools. There are 1,720 students in the sample. The sample size from each school ranges from 14 students to 61 students.

Text file versions of the partial (used in the notes) and full (for you to play with) High School Math Achievement data sets can be obtained as follows:

 > hs <- read.csv("")
 > hsfull <- read.csv("")
See a summary description ( of the variables.

Let just consider the relationship between math achievement, Y, and SES, X . Consider three ways (we’ll see more later) of estimating the relationship between X and Y.

  1. An ecological regression of the individual school means of Y on the individual school means of X.
  2. Pooled regression of Y on X ignoring school entirely. This estimates what is sometimes called the marginal relationship between Y and


  1. A regression of Y on X adjusting for school, i.e. including school as a categorical variable or, equivalently, first ‘partialing out’ the effect of schools. This estimates the conditional relationship between Y and X.

Robinson’s paradox to the fact that 1) and 3) can have opposite signs. Simpson’s paradox refers to the fact that 2) and 3) can also have opposite signs. In fact, estimate 2) will lie between 1) and 3). How much closer it is to 1) than to 3) depends on the variance of X and Y within schools.

Looking at school 4458

The estimates coefficients in "beta space"

  • Shadows are ordinary 95% confidence intervals.
  • Shadow on the horizontal axis is a 95% confidence interval for the true slope.
  • Note that the interval includes β1 = 0 so we would accept Ho1 = o

Model to compare two schools

We suppose that the relationship between math achievement and SES might be different in two schools. We can index βs by school:

Yij = β0j + β1jXij + rij

where j = 1,2 and i = 1,...,nj where nj is the number of students measured in school j. Again, we assume rij to be iid N(0,σ2).

We can fit this model and ask various questions:

  1. Do the two schools have the same structure, i.e. are the βs the same?
  2. Perhaps the intercepts are different with one school "better" than the other but the slopes are the same.
  3. Perhaps the slopes are different but one school is better than the other over the relevant range of X.
  4. Maybe there is an essential interaction with one school better for high X and the other for low X - i.e. the two lines cross over the observed range of Xs.
  5. We could also allow different σ2 and test equality of variances (conditional given X).
  6. We should also be ready to question the linearity implied by the model.

The following regression output allows us to answer some of these questions. If it seems reasonable we could refit a model without an interaction term. Note that inferences based on ordinary regression only generalize to new samples of students from the same schools. They do not justify statements comparing the two sectors.

summary(lm(MathAch ~ SES * Sector,dataset))
Value Std Error t value
Intercept 6.9992 1.1618 6.0244 0.0000
SES 1.1318 0.9454 1.1972 0.2348
Sector 11.3997 1.6096 7.0823 0.0000
SES:Sector 0.7225 1.5340 0.4710 0.6390
Residual standard error: 4.188 on 79 degrees of freedom Multiple R-Squared: 0.7418
F-statistic: 75.67 on 3 and 79 degrees of freedom, the p-value is 0
Correlation of Coefficients:
(Intercept) SES Sector
SES 0.8540
Sector -0.7218 -0.6164
SES:Sector -0.5263 -0.6163 -0.0410
Public 0
Catholic 1

Comparing two Sectors

One possible approach: two-stage analysis or 'derived variables':

Stage 1: Perform a regression on each school to get a βj for each school

Stage 2: Analyze the βjs from each school using multivariate methods to estimate the mean line in each sectore. Use multivariate methods to compare the two extimated mean lines.

The figure on the left shows the mean lines from each sector. The figure on the right shows each extimated intercept and slope from each school (i.e. βjs). The large ellipses are the sample dispersion ellipses for theβjs from each sector. The smaller ellipses are 95% confidence ellipses for the mean lines from each sector.

What’s wrong with That?

Essentially, it gives the least-squares line of each school equal weight in estimating the mean line for each sector. If every school sample had:

  1. the same number of subjects
  2. the same set of values of SES
  3. the same σ2 (we’ve been tacitly assuming this anyways)

Then giving each least-squares line the same weight would make sense.

In this case, the shape of the confidence ellipses in beta space would be identical for every school and the 2 stage procedure would give the same result as a classical two-level ANOVA.

This data has different N’s and different sets of values for SES in each school. The 2-stage process ignores the fact that some \hat{\beta_j} s are measured with greater precision than others (e.g. if the ith school has a large N or a large spread in SES, its hatβj is measured with more precision).

We could attempt to correct this problem by using weights proportional to

Var(\hat{\beta_j}|\beta_j) = \sigma^2(\mathbf{X_j^T}\mathbf{X_j})^{-1}

but this is the variance of \hat{\beta_j} as an estimate of the true βj for the ith School, NOT as an estimate of the underlying β for the sector.

We would need to use weights proportional to:

Var(\hat{\beta_j}) = \mathbf{V_j} = \sigma^2(\mathbf{X_j^T}\mathbf{X_j})^{-1}+ \mathbf{T}

where \mathbf{T} stands for Tao, the between school variability

but we don’t know \mathbf{V_j} without knowing σ2 and \mathbf{T}.

Another approach would be to use ordinary least-squares with School as a categorical predictor. This has other drawbacks. Comparing sectors is difficult. We can’t just include a “Sector” effect because it would be collinear with Schools. We could construct a Sector contrast but the confidence intervals and tests would not generalize to new samples of schools, only to new samples of students from the same sample of schools.

The Hierarchical Model

We develop the ideas for mixed and multilevel modeling in two stages:

  1. Multilevel models as presented in Bryk and Raudenbush (1992) in which the unobserved parameters at the lower level are modeled at the higher level. This is the representation used in HLM, the software developed by Bryk and Raudenbush and, to a limited extent in MLwiN.
  2. Mixed model in which the levels are combined into a combined equation with two parts: one for ‘fixed effects’ and the other for ‘random effects.’ This is the form used in SAS and in many other packages.

Although the former is more complex, it seems more natural and provides a more intuitive approach to the models.

We will use the high school Math Achievement data mentioned above as an extensive example. We think of our data as structured in two levels: students within schools and between schools.

We also have two types of predictor variables:

  1. within-school variables: Individual student variables: SES, female, minority status. These variables are also known by many other names, e.g. inner variables, micro variables, level-1 variables1, time-varying variables in the longitudinal context.
  2. between-school variables: Sector: Catholic or Public, meanses, size. These variables are also known as outer variables, macro variables, level-2 variables, or time-invariant variables in a longitudinal context. A between-school variable can be created from a within-school variable by taking the average of the within-school variable within each school. Such a derived between-school variable is known as a ‘contextual’ variable. Of course, such a variable is useful only if the average differs from school to school. Balanced data in which the set of values of within-school variables would be the same in each school does not give rise to contextual variables.

In some hierarchical modeling traditions the numbering of levels is reversed going from the top down instead of going from the bottom up. One needs to check which approach an author is using.

Basic structure of the model:

  1. Each school has a true regression line that is not directly observed
  2. The observations from each school are generated by taking random observations generated with the school’s true regression line
  3. The true regression lines come from a population or populations of regression lines

Within School model

For school i: (For now we suppose all schools come from the same population, e.g. only one Sector)

  1. True but unknown \beta_j = \begin{bmatrix} \beta_{0j} \\ \beta_{SESj} \\  \end{bmatrix} = \begin{bmatrix} \beta_{0j} \\ \beta_{1j}\\ \end{bmatrix} for each school
  2. The data are generated as \mathbf{Y_{ij}} = \beta_{0j} + \beta_{0j}X_{ij} + \varepsilon{ij}
\varepsilon_{ij} ~ N(0,\sigma^2) independent of βjs

Between School model

We start by supposing that the \beta_j = \begin{bmatrix} \beta_{0j} \\ \beta_{SESj} \\  \end{bmatrix} = \begin{bmatrix} \beta_{0j} \\ \beta_{1j}\\ \end{bmatrix} are sampled from a single population of shools. In vector notation this is

\beta_j = \gamma + \mathbf{u_j}
\mathbf{u_j} ~ N(0,\mathbf{T})
\mathbf{T} = \begin{bmatrix} \tau_{00} & \tau_{10} \\ \tau_{10} & \tau_{11}\\ \end{bmatrix}

Where \mathbf{T} is a variance matrix.

Writing out the elements of the vectors:

\beta_j = \begin{bmatrix} \beta_{0j} \\ \beta_{1j} \end{bmatrix} = \begin{bmatrix} \gamma_0 \\ \gamma_1 \end{bmatrix} + \begin{bmatrix} u_{0j} \\ u_{1j} \end{bmatrix}  ,  \begin{bmatrix} u_{0j} \\ u_{1j} \end{bmatrix} \sim N(\begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \tau_{00} & \tau_{01} \\ \tau_{10} & \tau_{11} \end{bmatrix})


Var0i) = τ00
Var1i) = τ11
Cov0i1i) = τ10 = τ01

By combining the two models above...we essentially get the mixed model!

Between-School Model: What γ means

Combined Model (Mixed Models)

From the multilevel to the combined form

Note the following: 1. 1. the fixed model contains both outer variables and inner variables as well as an interaction between inner and outer variables. This kind of interaction is called a ‘cross-level’ interaction. It allows the effect of X to be different in each Sector. 2. 2. the random effects model only contains an intercept and an inner variable. There are very arcane situations in which it might make sense to include an outer variable in the random effects portion of the model which we will consider briefly later.

Understanding the connection between the multilevel model and the combined model is useful because some packages require the model to be specified in its multilevel form (e.g. MLWin) while others require the model to be specified in its combined form as two models: the fixed effects model and the random effects model (e.g. SAS PROC MIXED, R and S-Plus lme() and nlme()).

GLS form of the model

Matrix form

Notational Babel

Mixed models were simultaneously and semi independently developed by researchers in many different disciplines, each developing its own notation. The notation we are using here is that of Bryk and Raudenbush (1992) which has been very influential in social research. Many publications use this notation. It differs from the notation used in SAS documentation whose development was more influenced by seminal statistical work in animal husbandry. It is, of course, perfectly normal to fit models in SAS but to report findings using the notation in common use in the subject matter area. A short bilingual dictionary follows. Fortunately, Y, X and Z are used with the same meaning.

The GLS fit

Simple Mixed Models

We have now built up the notation and some theory for a fairly general form of the linear mixed model with both inner and outer variables and a random effects model with a random intercept and a random slope. We will now consider the interpretation of simpler models in which we keep only some components of the more general model. Even when we are interested in the larger model, it is important to understand the simple ‘sub-models’ because they are used for hypothesis testing in the larger model. We will also consider some extensions of the concepts we have seen so far in the context of some of these simpler models.

One-way ANOVA with random effects

Estimating the one-way ANOVA model

Mixed model approach


Complex Mixed Models

Means as outcomes regression

One-way ANCOVA with random effects

Random coefficients model

Intercepts and Slopes as outcomes

Nonrandom slopes

Asking questions: CONTRAST and ESTIMATE statement

We will look more deeply into multilevel models to see what our estimators are really estimating. We will learn about the potential bias of estimators when between cluster and within cluster effects are different and we will see the role of contextual variables (e.g. cluster averages of inner variables) and within cluster residuals (i.e. inner variables centered by cluster means).

What is a mixed model really estimating

Var Y X) and T

There are two ways of visualizing T : directly as Var(uj ) or through its effect on the shape of the variance of Y as a function of the predictor X. This relationship is more than a mathematical curiosity. It is important for an understanding of the interpretation of the components of T and for an understanding the meaning of hypotheses involving T . We will consider the case of a single predictor and the case of two predictors.

Random slope model

Two random predictors

Interpreting Chol(T)

Fitting a multilevel model: contextual effects


Longitudinal Data

The structure of longitudinal data shares much with that of multilevel data. Turn the school into a subject and the students in the school into times or ‘occasions’ on which the subject is measured and, except for the fact that measurements are ordered in time, the structures are essentially the same. Mixed models offer only one of of number of ways of approaching the analysis of longitudinal data but mixed models are extensions of some of the traditional ways and provide more flexibility. We will briefly review traditional methods and then discuss the use of mixed models for longitudinal data.

The best method depends on a number of factors:

  • The number of subjects: if the number is very small it might not be feasible to treat subjects as random. The analysis then focuses on an ‘anecdotal’ description of the observed subject and does not generalize to the population of subjects. Statistical models play a role in analyzing the structure of responses within subjects especially if the subjects were measured on a large number of occasions. This situation is typical or research in psychotherapy where a few subjects are observed over a long time. With long series and few subjects, time series analysis can be an appropriate method.
  • The number of occasions and whether the timing of occasions is fixed or variable from subject to subject. With a small number (<10?) of fixed occasions and with the same ‘within subject’ design for each subject, the traditional methods of repeated measures analysis can be used. There are two common repeated measures models: univariate repeated measures and multivariate repeated measures. In univariate repeated measures models the random variability from subject to subject is equivalent (almost) to a random intercept mixed model. In multivariate repeated measures, no constraints are placed on the intercorrelations between occasions. These two models can be thought of as extreme cases: a model with only one covariance parameters and a model with all possible covariance parameters. These two models turn out to be special cases of mixed models which provide the ability to specify a whole range of models between these extremes, models that allow simpler covariance structures based on the structure of the data and the fact that, having been observed in time, observations that are close in time are often expected to be more strongly correlated than observations that are farther in time.
  • Mixed model methods are particularly well suited to data with many subjects and variable within-subject designs, either due to variable occasions or to time-varying covariates.

We will look at four data examples to illustrate the breadth of applications of mixed models to longitudinal data:

  • The classical Potthoff and Roy (1964) data of jaw sizes of 27 children (16 boys and 11 girls) at ages 8, 10, 12 and 14. This is a simple yet interesting example of longitudinal data and helps understand the major concepts in a simple context.
  • PSID: Panel Study of Income Dynamics: An excerpt from the PSID conducted by ISR at the University of Michigan at Ann Arbor will illustrate the application of methods on a survey data set with fixed occasions and time-varying covariates.
  • IQ recovery of post-coma patients: Long-term non-linear model with highly variable timing for occasions.
  • Migraine diaries: A dichotomous dependent variable with a non-linear long-term response model including seasonal periodic components and long-term non-stationarity.

The four data sets can be downloaded as self-extracting files from the course web site at Longitudinal Data Link (

After downloading the files into a convenient directory in Microsoft Windows, double click on the file icon to extract the SAS ssd file. Then, from SAS, define the directory as a library. In the following code, it is assumed that the name of the SAS library is ‘MIXED’. The SAS data set names for the four data sets above are: PR, PSID, IQ and MIGRAINE, respectively.

The basic model

We can think of longitudinal data as multilevel data with t (for a time index) replacing the within cluster index i. We could keep j for subjects but to be consistent with many authors such as Snijders and Bosker (1999) we will switch from j to i. We also switch the order of the indices. Yit denotes subject (cluster) i observed on occasion t. Not only do we change notation, we also change the order of the indices: the cluster index comes first. The other difference is that at least one of the X variables will be time or a function of time. A simple model for the Pothoff and Roy data with a linear trend in time that varies randomly from child to child would look just like a multilevel model with the following level 1 model:

Y_{it} = \beta_{0i} + \beta_{1i}X_{it} + \varepsilon_{it}

where i ranges over the 27 subjects, t is the time index: t=1,2,3,4, and Xit takes on the actual years: 8,10,12,14.

Now we come to the main departure from multilevel models: what to assume about \varepsilon_{it}. It no longer seems reasonable to unquestioningly assume that \varepsilon_{it}s are independent as t varies for a given i. Observations close in time might be more strongly related than observations that are farther apart. (Note that this relationship does not necessarily lead to a positive correlation.)

The Laird-Ware form of the model for a single child is:

\mathbf{Y_{i}} = \mathbf{X_i}\mathbf{\beta_i} + \mathbf{\varepsilon_i}
= \mathbf{X_i}\gamma + \mathbf{Z_i}\mathbf{u_i} + \varepsilon_i

Analyzing longitudinal data

Classical or Mixed models

Pothoff and Roy

Univariate ANOVA

MANOVA repeated measures

Random Intercept Model with Autocorrelation

Comparing Different Covariance Models

Using a REPEATED statement with missing occasions

The REPEATED defines a model for the Σ matrix. The default choice, Σ =σ I , does not require the same number of occasions for each subject. More complex choices, e.g. UN (unstructured), FA0(T), AR(1), etc. are not inherently defined if the number of occasions changes from subject to subject. One can think of a Σ matrix defined for the largest number of occasions. For subjects who do not have the maximum number of occasions, it is necessary to determine how the recorded occasions correspond to the rows of the Σ matrix. For example, if the Σ matrix is 5 x 5 and a subject with incomplete data has data for only 3 occasions, it is necessary to know whether these three occasions correspond to the 1st, 2nd and 3rd rows of Σ , or the 1st, 3rd and 4th, etc. This problem suggests that,

Links and other finds


Interesting links

  • Links for longitudinal data analysis
  • Peter Lane (2004) ( "Comparing two strategies for primary analysis of longitudinal trials with missing data" Powerpoint presentation.
Interesting comparison of different approaches to dealing with missing data including mixed models and multiple imputation. The paper might have missed the point that mixed models will work for MAR data if the predictors with respect to which the data are MAR are included in the 'target model'. Multiple imputation, on the other hand, only needs these variables in the imputation model, not necessarily in the target model.
  • Georges Monette's home page (


Day 3

R scripts


  • R versions of the partial (used in the notes) and full (for you to play with) High School Math Achievement data sets can be obtained as follows:
 > hs <- read.csv("")
 > hsfull <- read.csv("")
See a summary description ( of the variables.