Statistics: Validation

From MathWiki

Good source: Harrell, F. E., (2001) Regression Modeling Strategies. Springer. pp 99ff

Some links

Purpose of validation

To assess how model is likely to fit new data (population?). We can tell how well it fits the data set being analyzed but too complex a model might just be fitting random aspects of the data being analyzed. In other words we want to fit signal and not noise.

We need a measure of predictive accuracy that is not biased by overfitting.

3 causes of failure to validate:

  1. overfitting
  2. changes in measurement methods, definitions of variables
  3. changes in subject inclusion criteria

Modes of model validation:

  1. internal: use analyzed data
  2. external: use new data or data splitting
See val.prob and val.surv in Design


  1. R2: good but biased
  2. adjusted R2: unbiased but only if model prespecified -- not for selected models, i.e. okay if p is honest and includes all variables ever examined (formally or informally)

Two aspects of predictive accuracy that need to be assessed:

  1. Calibration or reliability: ability of model to make unbiased estimates of outcomes
  2. Discrimination is ability to separate subject's outcomes

First aspect (sec of 4.5 of Harrell) Beta(Y,Yhat) = 1 on fitted data but Beta(Y,Yhat) < 1 on new data.

Example: 10 sample of sized 50 from iid N(.5, 1)
Beta(Y,Yhat) = 1 on data but Beta(Y,Yhat) = 0 on new data.
Let γ is
van Houwelingen and le Cessie: \hat{\gamma}=\frac{model \chi^2 - p}{model \chi^2}
For ordinary linear model \hat{\gamma}=\frac{n-p-1}{n-1}\frac{R^2_{}adj}{R^2}

where 1 - R^2_{adj}= (1 - R^2) \frac{n-1}{n-p-1} leading to \hat{\beta}^s_0=(1-\hat{\gamma})\bar(Y) + \hat{\gamma}\hat{\beta}_0 and \hat{\beta}^s_j=\hat{\gamma}\hat{\beta}_j, j= 1,...p

Data for example by ? ?