# Statistics: Validation

### From MathWiki

Good source: Harrell, F. E., (2001) *Regression Modeling Strategies.* Springer. pp 99ff

[edit]

## Some links

http://www.quantlet.com/mdstat/scripts/csa/html/node123.html

[edit]

## Purpose of validation

To assess how model is likely to fit new data (population?). We can tell how well it fits the data set being analyzed but too complex a model might just be fitting random aspects of the data being analyzed. In other words we want to fit signal and not noise.

We need a measure of *predictive accuracy* that is not biased by overfitting.

3 causes of failure to validate:

- overfitting
- changes in measurement methods, definitions of variables
- changes in subject inclusion criteria

Modes of model validation:

- internal: use analyzed data
- external: use new data or data splitting

- See val.prob and val.surv in Design

Measures:

- R2: good but biased
- adjusted R2: unbiased but only if model prespecified -- not for selected models, i.e. okay if
*p is honest*and includes all variables ever examined (formally or informally)

Two aspects of predictive accuracy that need to be assessed:

- Calibration or reliability: ability of model to make unbiased estimates of outcomes
- Discrimination is ability to
*separate*subject's outcomes

First aspect (sec of 4.5 of Harrell) Beta(Y,Yhat) = 1 on fitted data but Beta(Y,Yhat) < 1 on new data.

- Example: 10 sample of sized 50 from iid N(.5, 1)
- Beta(Y,Yhat) = 1 on data but Beta(Y,Yhat) = 0 on new data.
- Let γ is
- van Houwelingen and le Cessie:
- For ordinary linear model

where leading to and

[edit]