Prac07 Pearson Assigment 3

Individual Assignments #1

_____________________________________________________________________________

Coffee, Stress and Health Example

(Original) Question

Look at the data set http://www.math.yorku.ca/~georges/Data/coffee.csv. It has three relevant variables, 'Heart', which is a measure of heart condition -- the higher the less healthy; 'Coffee', a measure of coffee consumption, and finally, 'Stress', measure of occupational stress. How could you use this data to address the question whether coffee consumption is harmful to the heart? Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.

More details is presented in the discussion part.

The variables "Coffee”, "Stress" and "Heart" are assumed to be continuous and independent. Continuous seems to be a reasonable assumption because the variables have a substantial range beyond a few values. Independence seems reasonable because the variables seem to be measuring quite distinct phenomena pertaining to coffee consumption, occupational stress and heart condition. We will fit some simple linear regression with the i.i.d. assumption for the residuals ~ N(0, constant variance).

First, we will display the data in a table and do a scatterplot matrix in order to check the data and to get a sense of its structure and possible patterns of note.

Table Display of Data

 ID Coffee Stress Heart 1 23 14 6 2 35 34 9 3 38 58 41 4 48 50 31 5 52 86 63 6 56 73 44 7 61 82 69 8 62 87 80 9 64 74 63 10 71 80 72 11 74 87 83 12 76 92 58 13 87 128 113 14 97 115 88 15 100 123 92 16 104 117 92 17 107 148 144 18 124 146 103 19 141 175 145 20 154 197 162

Scatterplot Matrix of Heart, Coffee and Stress

Please note that Coffee and Stress are highly confounded. Stress versus Coffee closely follows a straight-line (simple linear) fit, as implemented below. There do not appear to be any outliers of concern. We also note that Health versus Coffee and Health versus Stress both closely follow a straight-line (simple linear) fit.

Also we can compute the estimated covariance of these three variables based on this data set.

$Cov(Coffee, Stress, Heart) = \begin{pmatrix}1244.116 & 1573.989 & 1378.705\\ 1573.989& 2103.484 &1878.768\\ 1378.705& 1878.768 &1785.147\\ \end{pmatrix}\!$

We fit the following simple linear models below:

a) Heart~Coffee+Stress, b) Heart~Coffee, c) Heart~Stress and d) Coffee~Stress

lm(formula = Heart ~ Coffee + Stress)

Residuals:

Min 1Q Median 3Q Max

-13.5744 -7.5225 -0.4664 6.8669 18.0733

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -7.7943 5.7927 -1.346 0.196

Coffee -0.4091 0.2918 -1.402 0.179

Stress 1.1993 0.2244 5.345 5.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.36 on 17 degrees of freedom

Multiple R-Squared: 0.9462, Adjusted R-squared: 0.9399

F-statistic: 149.6 on 2 and 17 DF, p-value: 1.620e-11

We note that the coefficient for Stress has the only Pr(>|t|) considered significant. Pr(>|t|) has the very low value of 5.36e-05. Stress and Coffee are highly correlated, but it appears Stress has a stronger linear relationship with Heart, and Stress has thereby ‘edged’ Coffee out from appearing significant.

Note all the confidence intervals mentioned below are 95% confidence intervals.

By simple calculation, we can easily get the correlation coefficient of the $\beta's\!$ in the conditional model, i.e. $Corr(\beta) = \begin{pmatrix} 1.00000000 & -0.25909905 & 0.04907943\\ -0.25909905 & 1.0000000 & -0.97297525\\ 0.04907943 & -0.97297525 & 1.00000000 \end{pmatrix}\!$

In additionally, we have $cov(\beta) = \begin{pmatrix}33.55490528 & -0.43788275 & 0.06378995\\ -0.43788275 & 0.08511929 & -0.06369283\\ 0.06378995 & -0.06369283 & 0.05034421\\ \end{pmatrix}$

It is not hard to see in the conditional model, there is a negative linear relationship between the coefficient of Coffee and coefficient of Stress which is also very clear from the $\beta\!$ ellipse below.

From the above model, it is clear that Coffee, whose coefficient is $-0.4091\!$ has a negative effect on Heart which means Coffee can help reduce Heart disease, this point is also very clear from the green surface in the 3d plot. However, form the Heart ~ Coffee model below, Coffee, whose coefficient is 1.1082, turns out to have a positive effect on Heart which means Coffee can increase Heart disease, and the purple surface give us a clear idea about the relationship between them. One thing need to mention here is that the purple surface for the marginal model is perpendicular to the Heart~Coffee surface. Definitely there is a contradiction between these two models if we draw conclusions like that since we are dealing with the same data set. However, the appearance of the conflict requires our further analysis of the data.

The shadows of the green ellipse form a Scheffe confidence interval, which is usually bigger than the Bonferroni confidence interval, when the number of parameters is small. The Bonferroni confidence interval is formed by the shadows of the red ellipse and incorporates a penalty for examing several coefficicents simultaneously.

The red confidence intervals are called Bonferroni confidence interval which are very familiar to us. Therefore the Bonferroni confidence interval in the conditional model for the coefficent of Coffee is $[-1.024942, 0.20646]\!$ which has 0 inside, thus it is not significant. However, the Bonferroni confidence interval in the conditional model for the coefficient of Stress is $[0.7258632, 1.6726436]\!$ which is highly significant. Meanwile, let us have a look at the Scheffe confidence interval for the coefficient of Coffee is $[-1.1909827, 0.3728805]\!$ and the interval for the coefficient of Stress is $[0.5979003, 1.8006065]\!$, which is supposed to include the Bonferroni intervals. In this case, either confidence interval we use is going to give us the same conclusion, but if we move the ellipses a little bit upward using measurment error on Stress, we can make $\beta_c\!$ significant according to Bonferroni confidence interval, on the other hand, it is still insignificant according to Scheffee confidence interval.

Let us look at the confidence region, both of these two ellipse do not include the origin, that means the vector $\begin{pmatrix}\beta_c \\ \beta_s \end{pmatrix}$ are significant, if we consider these two coefficients together instead of considering them separately. From the test statistics such as p value and correlation, etc, we can conclude this model is still satisfactory.

The regression line in the confidence ellipse is $\beta_{Stress} = -0.7904 + 0.8759 \beta_{Coffee}\!$, set $\beta_{Stress} = 0\!$, then we get $\beta_{Coffee} = 1.1082\!$ which is the coefficient for Coffee in the marginal model. Moreover, the blue part is the 95% confidence interval for Coffee in the following marginal model. We will talk about this later.

lm(formula = Heart ~ Coffee)

Residuals:

Min 1Q Median 3Q Max

-25.1006 -10.8545 -0.6428 10.4100 34.7385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.3138 9.2055 -1.012 0.325

Coffee 1.1082 0.1072 10.339 5.34e-09 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.48 on 18 degrees of freedom

Multiple R-Squared: 0.8559, Adjusted R-squared: 0.8479

F-statistic: 106.9 on 1 and 18 DF, p-value: 5.337e-09

We try Coffee only in the model and find that, not surprisingly, its coefficient is significant at the very low value of Pr(>|t|) = 5.34e-09. Moreover, the $95\% \!$ confidence interval for the coefficient of Coffee in the above model is [0.883,1.33] which does not include 0. Obviously it should be significant.

Let us try to find out something by comparing this model with the conditional model.

Since we know the result $E(X_1|X_2 =x_2) = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)\!$ where $\mu_1\!$ and $\mu_2 \!$ are the mean vectors for $X_1\!$ and $X_2\!$ respectively, and $\Sigma_{12}\!$ and $\Sigma_{22}\!$ are the covariance matrix for $X_1, X_2\!$ and $X_2\!$ respectively.

By differenting the conditional model with respect to Coffee, we get

$\frac{\partial Heart}{\partial Coffee} = -0.4091 + 1.1993 \frac{\partial E(Stress|Coffee)}{\partial Coffee} =-0.4091 + 1.1993 * \frac{1573.989}{1244.116} = 1.1082 \!$ which happens to be the coefficient of Coffee in the above model. Actually they should be the same since in the marginal model, if we do the same thing as what we did above by differenting Coffee, we obtain $\frac{\partial Heart}{\partial Coffee} = 1.1082\!$. There is no reason that $\frac{\partial Heart}{\partial Coffee}\!$ has two different values. Moreover, the confidence interval for $\beta_{Coffee}\!$ is $[0.8829868, 1.333375],\!$ which corresponds to the results from analyzing the model, is highly significant.

lm(formula = Heart ~ Stress)

Residuals:

Min 1Q Median 3Q Max

-17.504 -5.602 -2.004 7.246 21.709

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.89858 5.74297 -1.724 0.102

Stress 0.89317 0.05318 16.795 1.92e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.63 on 18 degrees of freedom

Multiple R-Squared: 0.94, Adjusted R-squared: 0.9367

F-statistic: 282.1 on 1 and 18 DF, p-value: 1.918e-12

We try Stress only in the model and find that, not surprisingly, its coefficient is significant at the even lower value of Pr(>|t|) = 1.92e-12 compared to the same for Coffee in the Coffee only simple linear regression above.

lm(formula = Coffee ~ Stress)

Residuals:

Min 1Q Median 3Q Max

-17.496 -6.188 3.150 5.532 11.307

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.14434 4.51999 1.138 0.27

Stress 0.74828 0.04186 17.877 6.62e-13 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.368 on 18 degrees of freedom

Multiple R-Squared: 0.9467, Adjusted R-squared: 0.9437

F-statistic: 319.6 on 1 and 18 DF, p-value: 6.621e-13

As stated earlier, Coffee and Stress have a strong simple linear relationship as further confirmed by the analysis.

We could now examine residual plots to check our i.i.d. and Normality assumptions. However, I do not think that these plots would substantially affect our concern that Coffee and Stress are highly confounded.

In the scatterplots and the Heart~Coffee and Heart~Stress models, we note that Heart appears to increase (less healthy) in a linear fashion as Coffee or Stress increase. While Heart can be predicted well by Coffee, the relationship may be spurious including coffee consumption may not be a cause for heart condition. For example, we can speculate that an alternative explanation may be that occupational stress may be the underlying cause for coffee consumption and the heart condition. In this manner, the latter two variables may not have any substantial relationship. It is also possible that there is some other variable, not measured, that is a cause for all three of these variables, and so all three variables measured may have no substantial relationship other than being highly correlated in this study.

In practice, it would be good to consult the relevant medical literature and health researchers to put this study in context. It may be that occupational stress is already determined to be a notable effect for heart condition. Also, it would be helpful to have a larger survey selecting people at random with the possibility of some participants who may have low occupational stress but high coffee consumption and vice versa. Observations from such participants, if randomly gathered with the rest of the observations in a proper manner, could help distinguish between any potential effects of Coffee and Stress on Heart.