# Statistics: Validation

Good source: Harrell, F. E., (2001) Regression Modeling Strategies. Springer. pp 99ff

## Purpose of validation

To assess how model is likely to fit new data (population?). We can tell how well it fits the data set being analyzed but too complex a model might just be fitting random aspects of the data being analyzed. In other words we want to fit signal and not noise.

We need a measure of predictive accuracy that is not biased by overfitting.

3 causes of failure to validate:

1. overfitting
2. changes in measurement methods, definitions of variables
3. changes in subject inclusion criteria

Modes of model validation:

1. internal: use analyzed data
2. external: use new data or data splitting
See val.prob and val.surv in Design

Measures:

1. R2: good but biased
2. adjusted R2: unbiased but only if model prespecified -- not for selected models, i.e. okay if p is honest and includes all variables ever examined (formally or informally)

Two aspects of predictive accuracy that need to be assessed:

1. Calibration or reliability: ability of model to make unbiased estimates of outcomes
2. Discrimination is ability to separate subject's outcomes

First aspect (sec of 4.5 of Harrell) Beta(Y,Yhat) = 1 on fitted data but Beta(Y,Yhat) < 1 on new data.

Example: 10 sample of sized 50 from iid N(.5, 1)
Beta(Y,Yhat) = 1 on data but Beta(Y,Yhat) = 0 on new data.
Let γ is
van Houwelingen and le Cessie: $\hat{\gamma}=\frac{model \chi^2 - p}{model \chi^2}$
For ordinary linear model $\hat{\gamma}=\frac{n-p-1}{n-1}\frac{R^2_{}adj}{R^2}$

where $1 - R^2_{adj}= (1 - R^2) \frac{n-1}{n-p-1}$ leading to $\hat{\beta}^s_0=(1-\hat{\gamma})\bar(Y) + \hat{\gamma}\hat{\beta}_0$ and $\hat{\beta}^s_j=\hat{\gamma}\hat{\beta}_j, j= 1,...p$