Statistics: Validation

From MathWiki

Good source: Harrell, F. E., (2001) Regression Modeling Strategies. Springer. pp 99ff

Some links

http://www.quantlet.com/mdstat/scripts/csa/html/node123.html

Purpose of validation

To assess how model is likely to fit new data (population?). We can tell how well it fits the data set being analyzed but too complex a model might just be fitting random aspects of the data being analyzed. In other words we want to fit signal and not noise.

We need a measure of predictive accuracy that is not biased by overfitting.

3 causes of failure to validate:

  1. overfitting
  2. changes in measurement methods, definitions of variables
  3. changes in subject inclusion criteria

Modes of model validation:

  1. internal: use analyzed data
  2. external: use new data or data splitting
See val.prob and val.surv in Design

Measures:

  1. R2: good but biased
  2. adjusted R2: unbiased but only if model prespecified -- not for selected models, i.e. okay if p is honest and includes all variables ever examined (formally or informally)

Two aspects of predictive accuracy that need to be assessed:

  1. Calibration or reliability: ability of model to make unbiased estimates of outcomes
  2. Discrimination is ability to separate subject's outcomes

First aspect (sec of 4.5 of Harrell) Beta(Y,Yhat) = 1 on fitted data but Beta(Y,Yhat) < 1 on new data.

Example: 10 sample of sized 50 from iid N(.5, 1)
Beta(Y,Yhat) = 1 on data but Beta(Y,Yhat) = 0 on new data.
Let γ is
van Houwelingen and le Cessie: \hat{\gamma}=\frac{model \chi^2 - p}{model \chi^2}
For ordinary linear model \hat{\gamma}=\frac{n-p-1}{n-1}\frac{R^2_{}adj}{R^2}

where 1 - R^2_{adj}= (1 - R^2) \frac{n-1}{n-p-1} leading to \hat{\beta}^s_0=(1-\hat{\gamma})\bar(Y) + \hat{\gamma}\hat{\beta}_0 and \hat{\beta}^s_j=\hat{\gamma}\hat{\beta}_j, j= 1,...p

Data for example by ? ?

http://cdiac.ornl.gov/ftp/ndp070