|
| 1 | +--- |
| 2 | +title: "Assignment 4" |
| 3 | +author: '108048110' |
| 4 | +date: "2022-11-12" |
| 5 | +output: |
| 6 | + pdf_document: default |
| 7 | + html_document: default |
| 8 | +--- |
| 9 | + |
| 10 | +```{r setup, include=FALSE} |
| 11 | +knitr::opts_chunk$set(echo = TRUE) |
| 12 | +``` |
| 13 | + |
| 14 | +## Problem 1 |
| 15 | + |
| 16 | +```{r include=FALSE} |
| 17 | +library(GGally) |
| 18 | +library(lmtest) |
| 19 | +require(ellipse) |
| 20 | +``` |
| 21 | + |
| 22 | +```{r include=FALSE} |
| 23 | +data1 = read.table('./datasets/wastes.txt', header=TRUE) |
| 24 | +``` |
| 25 | + |
| 26 | +This is an experimental data of five laboratory measurements conducted to interpret total oxygen demand in dairy waste. Data were collected on samples kept in suspension in water in a laboratory for 220 days, and we assume that all observations are independent. |
| 27 | + |
| 28 | +| Item | Variable | Description | Unit | |
| 29 | +|------|----------|------------------------------|--------------------| |
| 30 | +| 1 | y | $log(oxygen \ demand)$ | mg oxygen per unit | |
| 31 | +| 2 | x1 | $biological\ oxygen\ demand$ | mg/liter | |
| 32 | +| 3 | x2 | $Total\ Kjeldahl\ nitrogen$ | mg/liter | |
| 33 | +| 4 | x3 | $Total\ solids$ | mg/liter | |
| 34 | +| 5 | x4 | $Total\ volatile\ solids$ | mg/liter | |
| 35 | +| 6 | x5 | $Chemical\ oxygen\ demand$ | mg/liter | |
| 36 | + |
| 37 | +: Total oxygen demand in dairy wastes |
| 38 | + |
| 39 | +```{r echo=FALSE} |
| 40 | +attach(data1) |
| 41 | +summary(data1, cor=T) |
| 42 | +ggpairs(data1[,-1]) |
| 43 | +``` |
| 44 | + |
| 45 | +- $x2, x3, x5\ and\ y$ are right skewed, while $x4$ is left skewed. |
| 46 | + |
| 47 | +- From the correlation coefficient table we found that the estimator of $x4$, $\beta_4$ has a very high negative correlation associate with the intercept $(\beta_0)$, which might result from the range of $x4$, which compared to other variables, is relatively small. |
| 48 | + |
| 49 | +- As we can observe from the plot, the values of correlation coefficient between $x1\ and\ x3$, $x1\ and\ x4$, $x1\ and\ x5$, $x3\ and\ x5$, $x4\ and\ x5$ are very high, in addition, the response variable $y$ seemed to have be positively correlated with $x1, x3, x4\ and\ x5$. |
| 50 | + |
| 51 | +### Fit a multiple regression model $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$, using $y$ as the dependent variable and all $x_j’s$ as the independent variables. |
| 52 | + |
| 53 | +Below is the result of fitting a multiple regression model. |
| 54 | + |
| 55 | +```{r echo=FALSE} |
| 56 | +model1 = lm(y~x1+x2+x3+x4+x5) |
| 57 | +smodel1 = summary(model1, cor=T) |
| 58 | +smodel1 |
| 59 | +``` |
| 60 | + |
| 61 | +#### a. Form a 95% confidence interval for $\beta_3$ and again for $\beta_5$. |
| 62 | + |
| 63 | +- 95% Critical value, $df_\Omega=n-p=20-6=14$ |
| 64 | + |
| 65 | +```{r echo=FALSE} |
| 66 | +qt(0.975, 14) |
| 67 | +``` |
| 68 | + |
| 69 | +- General Form: $(A\hat\beta - A\beta)^{T} (A(X^{T}X)^{-1}(A\hat\beta-A\beta) / (d * \hat\sigma^2) \le F_{d, n-p}(\alpha)$ |
| 70 | +- 95% confidence interval for $\beta_3\ and\ \beta_5$. |
| 71 | + - If d=1: |
| 72 | + |
| 73 | +$$\frac{(A\hat\beta-A\beta)^2}{(A(X^TX)^{-1}A^T)\hat\sigma^2} \le\ F_{1, n-p}(\alpha) = \frac{|A\hat\beta-A\beta|}{\sqrt(A(X^TX)^{-1}A^T)*\hat\sigma} \le t_{n-p}(\alpha/2)$$ |
| 74 | + |
| 75 | +- In this case, $n=20, p=6, \alpha=0.05, t_{14, 0.025}=2.145$. |
| 76 | + |
| 77 | +- 95% CI for $\beta_3$ = $A_3 = (0,0,0,1,0,0), A\hat\beta = \hat\beta_3\ , \ A_5 = (0,0,0,0,0,1), A\hat\beta = \hat\beta_5:$ |
| 78 | + |
| 79 | + - $\hat\beta_3 \pm t_{14}(0.025)*s.e.(\hat\beta_3) = 1.278*10^{-4}\pm2.145*7.69*10^{-5}$ |
| 80 | + |
| 81 | + - $\hat\beta_4 \pm t_{14}(0.025)*s.e.(\hat\beta_5) = 1.417*10^{-4} \pm 2.145 * 7.375*10^{-5}$ |
| 82 | + |
| 83 | +$confidence\ interval = estimate \pm critical\ value * s.e\ of\ estimate$ |
| 84 | + |
| 85 | +```{r echo=FALSE} |
| 86 | +knitr::kable(confint(model1)[c(4,6),]) |
| 87 | +``` |
| 88 | + |
| 89 | +- Both intervals contain 0, this indicates that the null hypotheses $H_0: \beta_3=0\ and\ H_0: \beta_5=0$ would not be rejected at the 5% significance level. We can see from the summary where tha p-value for $\beta_3\ and\ \beta_5$ are 11.88% and 7.54% respectively, confirming the point. |
| 90 | + |
| 91 | +#### b. Form a confidence interval for $\beta_3+2*\beta_5$. |
| 92 | + |
| 93 | +- In this case, $A\hat\beta = \hat\beta_3+2*\hat\beta_5, A = (0, 0,0,1,0,2)$ |
| 94 | + |
| 95 | +```{r echo=FALSE} |
| 96 | +A = c(0,0,0,1,0,2) |
| 97 | +x = model.matrix(model1) |
| 98 | +xtxi = solve(t(x)%*%x) |
| 99 | +se = sqrt(t(A)%*%xtxi%*%A)*summary(model1)$sigma |
| 100 | +
|
| 101 | +se |
| 102 | +``` |
| 103 | + |
| 104 | +- $\beta_3+2*\beta_5 = (\hat\beta_3+2*\hat\beta_5) \pm 2.145 * 1.642*10^{-4}$ |
| 105 | + |
| 106 | +#### c. Show a 95% C.R. graph for $\beta_3\ and\ \beta_5$. Plot the origin. State the hypothesis test and its outcome. |
| 107 | + |
| 108 | +```{r echo=FALSE} |
| 109 | +plot(ellipse(model1, c(4, 6)), type='l') |
| 110 | +points(coef(model1)[4], coef(model1)[6], pch=19) |
| 111 | +abline(v = confint(model1)[4,], lty=2) |
| 112 | +abline(h = confint(model1)[6,], lty=2) |
| 113 | +points(0, 0, pch=19, col='red') |
| 114 | +``` |
| 115 | + |
| 116 | +```{r echo=FALSE} |
| 117 | +summary(model1, cor=T)$cor[4, 6] |
| 118 | +``` |
| 119 | + |
| 120 | +- The seemly parallelism of the major axis of the ellipse to the y-axis and that of the semi-minor axis to the x-axis suggests that the two estimators, $\beta_3, \beta_5$ are uncorrelated. |
| 121 | + |
| 122 | +- We can deduced from the plot that the joint hypothesis $H_0: \beta_3=\beta_5=0$ is not reject because the origin lies right inside the ellipse |
| 123 | + |
| 124 | +- Both of the hypotheses $H_0: \beta_3=0\ and\ H_0: \beta_5=0$ are not rejected either because 0 does lie within the vertical and horizontal dashed lines which represents as the C.I. of $x3\ and\ x5$ respectively. |
| 125 | + |
| 126 | +- We can further conduct some experiments testing on the significance of $\beta_3$ where x5 is included in the model and the significance of $\beta_5$ where $x3$ is included in the model to ensure that both variables $x3, x5$ are not significant when the true model is assumed to be $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$. |
| 127 | + |
| 128 | +```{r echo=FALSE} |
| 129 | +model2 = lm(y~x1+x2+x4+x5) |
| 130 | +model3 = lm(y~x1+x2+x3+x4) |
| 131 | +model4 = lm(y~x1+x2+x4) |
| 132 | +
|
| 133 | +anova(model2, model4) |
| 134 | +anova(model3, model4) |
| 135 | +``` |
| 136 | + |
| 137 | +- In conclusion, as we can observe from the above tests and visualization, all of our hypotheses are not rejected, which indicate that neither of the estimators, $\beta_3, \beta_5$ are significant enough to be indispensable for response variable $y$ under the 95% confidence level. |
| 138 | + |
| 139 | +#### d. If a 95% joint confidence region was computed for $\beta_1, \beta_2, \beta_3, \beta_4, \beta_5$, would the origin lie inside or outside the region? Explain. |
| 140 | + |
| 141 | +```{r echo=FALSE} |
| 142 | +knitr::kable(confint(model1)) |
| 143 | +knitr::kable(confint(model1)[,1]<0 & confint(model1)[,2]>0) |
| 144 | +``` |
| 145 | + |
| 146 | +- As we can observe from the table, if measured individually, none of the predictors are significant; however, combining univariate tests to test on joint relationship may come up with a biased result, thus we need to conduct a test to compute the boundaries of the joint confidence region. |
| 147 | + |
| 148 | +```{r echo=FALSE} |
| 149 | +
|
| 150 | +#= cbind(rep(0, 6), diag(5)) |
| 151 | +
|
| 152 | +X = as.matrix(cbind(rep(1, 20), data1[,c(-1, -7)])) |
| 153 | +
|
| 154 | +e <- eigen(t(X)%*%X) |
| 155 | +A = t(e$vectors[,-1]) |
| 156 | +
|
| 157 | +betahat = matrix(model1$coefficients, nrow=6, ncol=1) |
| 158 | +beta = matrix(0L, nrow=6, ncol=1) |
| 159 | +critical_value = 2*(smodel1$sigma^2)*qf(0.05, 5, 14, lower.tail = TRUE) |
| 160 | +``` |
| 161 | + |
| 162 | +**Joint Effect** |
| 163 | + |
| 164 | +```{r echo=FALSE} |
| 165 | +t(A%*%betahat-A%*%beta)%*%solve(A%*%(solve(t(X)%*%X))%*%t(A))%*%(A%*%betahat-A%*%beta) |
| 166 | +``` |
| 167 | + |
| 168 | +**Critical Value** |
| 169 | + |
| 170 | +```{r echo=FALSE} |
| 171 | +critical_value |
| 172 | +``` |
| 173 | + |
| 174 | +- After testing the joint effect of $\beta_1, \beta_2, \beta_3, \beta_4, \beta_5$, we found the calculated result is significantly larger than the corresponding critical value $2*\hat\sigma^2* F_{5, 14} = 0.02957$, which implies that the origin $(0,0,0,0,0)$ is lying outside the region. |
| 175 | + |
| 176 | +- Thus, we can reject the null hypothesis that $H_0: \beta_1=\beta_2=\beta_3=\beta_4=\beta_5=0$. |
| 177 | + |
| 178 | +#### e. Suppose non-volatile solids have no linear effect on the response. State a hypothesis that reflects this suspicion, and test it using a C.I. in your answer to one of the above questions. Explain why the chosen confidence interval can be used to do this work. |
| 179 | + |
| 180 | +- Arrange our previous model: $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$ |
| 181 | + |
| 182 | +- Total solids - volatile solids = non-volatile solids: $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*(x3-x4)+\beta_4*x4+\beta_5*x5+\epsilon$ |
| 183 | + |
| 184 | +- $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+(\beta_4-\beta_3)*x4+\beta_5*x5+\epsilon$ |
| 185 | + |
| 186 | +- $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4^{'}*x4+\beta_5*x5+\epsilon,\ where\ \beta_4^{'} = \beta_4-\beta_3$ |
| 187 | + |
| 188 | +- As we can observe from the model equation, originally, $x4 \subset x3$, where $\beta_3$ is an estimator for both volatile and non-volatile solids while $\beta_4$ is an estimator for volatile solids, and therefore $\beta_4$ explains both regular parts from $x3\ and\ x4$; yet with the arrangement, now $\beta_3$ becomes the estimator for non-volatile solids which is uncorrelated to predictor $x4$. |
| 189 | + |
| 190 | +- Hence, we hypothesized that $H_0: \beta_3=0$, Since the arrangement is only a linear transformation performed on variable $x3$ which does not change the value of $\beta_3$, I would apply the C.I. value calculated from question **a** to test this hypothesis. |
| 191 | + |
| 192 | + - Set $\beta_3=0$ |
| 193 | + |
| 194 | + ```{r echo=FALSE} |
| 195 | + beta_3 = 0 |
| 196 | + ``` |
| 197 | +
|
| 198 | + - Introduce C.I. for $\beta_3$ from model1 |
| 199 | +
|
| 200 | + ```{r echo=FALSE} |
| 201 | + CI3 = confint(model1)[4,] |
| 202 | + CI3 |
| 203 | + ``` |
| 204 | +
|
| 205 | + - Whether we accept the null hypothesis |
| 206 | +
|
| 207 | + ```{r echo=FALSE} |
| 208 | + (beta_3 > CI3[1])&&(beta_3 < CI3[2]) |
| 209 | + ``` |
| 210 | +
|
| 211 | +<!-- --> |
| 212 | +
|
| 213 | +- Since 0 lies in the C.I. of $\beta_3$, we do not reject the null hypothesis, that is, the non-volatile solids have no linear effect on the response. |
| 214 | +
|
| 215 | +```{r echo=FALSE} |
| 216 | +detach(data1) |
| 217 | +``` |
| 218 | + |
| 219 | +## Problem 2 |
| 220 | + |
| 221 | +```{r echo=FALSE} |
| 222 | +data2 = read.table('http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/houseprices.txt', header=TRUE) |
| 223 | +``` |
| 224 | + |
| 225 | +This data are a random sample of home sales from Spring 1993 in Albuquerque. |
| 226 | + |
| 227 | ++------------+------------+-------------------------------------+-------------------+ |
| 228 | +| Item | Variable | Description | Unit | |
| 229 | ++============+============+=====================================+===================+ |
| 230 | +| 1 | Price | Selling price | $\$100$ | |
| 231 | ++------------+------------+-------------------------------------+-------------------+ |
| 232 | +| 2 | SQFT | living space | $feet^2$ | |
| 233 | ++------------+------------+-------------------------------------+-------------------+ |
| 234 | +| 3 | Age | Age of home | $year$ | |
| 235 | ++------------+------------+-------------------------------------+-------------------+ |
| 236 | +| 4 | Features | Number out of 11 features | (dish washer, | |
| 237 | +| | | | | |
| 238 | +| | | | refrigerator, | |
| 239 | +| | | | | |
| 240 | +| | | | microwave, | |
| 241 | +| | | | | |
| 242 | +| | | | disposer, | |
| 243 | +| | | | | |
| 244 | +| | | | washer, intercom, | |
| 245 | +| | | | | |
| 246 | +| | | | skylight(s), | |
| 247 | +| | | | | |
| 248 | +| | | | compactor, | |
| 249 | +| | | | | |
| 250 | +| | | | dryer, | |
| 251 | +| | | | | |
| 252 | +| | | | handicap fit, | |
| 253 | +| | | | | |
| 254 | +| | | | cable TV access) | |
| 255 | ++------------+------------+-------------------------------------+-------------------+ |
| 256 | +| 5 | NE | Located in northeast sector of city | $1\ or\ 0$ | |
| 257 | ++------------+------------+-------------------------------------+-------------------+ |
| 258 | +| 6 | Corner | Corner location | $1\ or\ 0$ | |
| 259 | ++------------+------------+-------------------------------------+-------------------+ |
| 260 | +| 7 | Taxes | Annual taxes | $\$$ | |
| 261 | ++------------+------------+-------------------------------------+-------------------+ |
| 262 | + |
| 263 | +: Home sales data |
| 264 | + |
| 265 | +#### a. There are a large number of missing values in the $Age$ variable. We could either exclude $Age$ from our models for the selling price or we could keep $Age$ and exclude the cases that have missing values for $Age$. Which choice is better for this data? Explain your reasoning. |
| 266 | + |
| 267 | +- Check how many missing values are in the data. |
| 268 | + |
| 269 | +```{r echo=FALSE} |
| 270 | +summary(data2) |
| 271 | +``` |
| 272 | + |
| 273 | +- As we can observe from the result of the summary, there are 49 na values in variable $Age$ and 10 in variable $Tax$. |
| 274 | + |
| 275 | +- Furthermore, although $Features$ variable description indicated there are 11 different features in the data, yet as we can see from the table, there are only 8 features at most including in our sample data. |
| 276 | + |
| 277 | +```{r echo=FALSE} |
| 278 | +nrow(data2) |
| 279 | +``` |
| 280 | + |
| 281 | +- However, there are only 117 samples in this data, removing all the missing values in $Age$ would indicate dropping nearly half the amount of the observations, which is irrational to give up so much observations; therefore, choosing to exclude variable $Age$ from the dataset is a more economical and reasonable option. |
| 282 | + |
| 283 | +```{r echo=FALSE} |
| 284 | +knitr::kable(head(data2[,-3], 6)) |
| 285 | +data2 = data2[,-3] |
| 286 | +data2 = data2[complete.cases(data2),] |
| 287 | +knitr::kable(summary(data2)) |
| 288 | +attach(data2) |
| 289 | +``` |
| 290 | + |
| 291 | +- New data after excluding $Age$ from the dataset and removing na values from variable $Tax$. |
| 292 | + |
| 293 | +```{r echo=FALSE} |
| 294 | +ggpairs(data2) |
| 295 | +``` |
| 296 | + |
| 297 | +- House $Price$ seems to be positively correlated with $SQFT, Tax$ which maps to our intuition. While the relationship between $Price$ and other variables are unclear. |
| 298 | + |
| 299 | +#### b. Fit a model with $Price$ as the response and $SQFT, Features, NE, Corner,Tax$ as predictors. Form 95% and 99% C.I. for their coefficients. Explain how the p-value for the parameter for $Corner$ relates to whether zero falls in the two corresponding C.I.. |
| 300 | + |
| 301 | +$Price = \beta_0+\beta_1*SQFT+\beta_2*Features+\beta_3*NE+\beta_4*Corner+\beta_5*Tax+\epsilon$ |
| 302 | + |
| 303 | +```{r echo=FALSE} |
| 304 | +model1 = lm(Price~SQFT+factor(Features)+factor(NE)+factor(Corner)+Tax) |
| 305 | +model2 = lm(Price~SQFT+Features+NE+Corner+Tax) |
| 306 | +model3 = lm(Price~SQFT+Features+factor(NE)+factor(Corner)+Tax) |
| 307 | +smodel1 = summary(model1, cor=T) |
| 308 | +smodel2 = summary(model2, cor=T) |
| 309 | +smodel3 = summary(model3, cor=T) |
| 310 | +
|
| 311 | +smodel1 |
| 312 | +#smodel2 |
| 313 | +#smodel3 |
| 314 | +``` |
| 315 | + |
| 316 | +- **Note.** Features, Corner and NE are qualitative data, the numeric value are categorical. |
| 317 | + |
| 318 | +```{r echo=FALSE} |
| 319 | +knitr::kable(confint(model1, level=0.95)) |
| 320 | +knitr::kable(confint(model1, level=0.99)) |
| 321 | +``` |
| 322 | + |
| 323 | +- p-value for $Corner$ is not significant whether the significance level is 95% or 99%. Since zero falls in both $\alpha=5\%, \alpha=1\%$ confidence intervals. |
| 324 | + |
| 325 | +#### c. Predict the $Price$ of a specific house with $SQFT=2500, Features=5, NE=1, Corner=1, Tax=120$. Give an appropriate 95% C.I.. |
| 326 | + |
| 327 | +$$Price = 154.548+0.253*SQFT-34.356*F_1-44.180*F_2-31.409*F_3-44.753*F_4+8.071*F_5+35.905*F_6+98.549*F_7+118.434*F_8-4.563*NE+-83.752*Corner+0.690*Tax+\epsilon$$ |
| 328 | + |
| 329 | +- **Note.** $If\ Featurs=x, F_x = 1, x\in[1,8]$ |
| 330 | + |
| 331 | +```{r echo=FALSE} |
| 332 | +predict(model1, data.frame(SQFT=2500, Features=5, NE=1, Corner=1, Tax=1200), interval="confidence") |
| 333 | +``` |
| 334 | + |
| 335 | +- This is an interpolation prediction computing for the C.I. of mean response. |
| 336 | + |
| 337 | +#### d. Suppose you are only told that $SQFT=2500$. Predict the $Price$ and 95% C.I.. |
| 338 | + |
| 339 | +##### When all variables have low correlation: |
| 340 | + |
| 341 | +```{r} |
| 342 | +model2 = lm(Price~., data2) |
| 343 | +x = data.frame(SQFT=2500, Features = median(Features),NE=which.max(table(NE)),Corner = which.max(table(Corner)),Tax = mean(Tax)) |
| 344 | + |
| 345 | +predict(model2, newdata=x,interval="predict") |
| 346 | +``` |
| 347 | + |
| 348 | +- Conditional Probability |
| 349 | + |
| 350 | +##### Other method: |
| 351 | + |
| 352 | +```{r} |
| 353 | +fit2.3 = lm(Price ~ SQFT, data2) |
| 354 | +predict(fit2.3, newdata=data.frame(SQFT=2500),interval="predict") |
| 355 | +``` |
0 commit comments