Skip to content

Commit c33aeba

Browse files
authoredFeb 6, 2023
Add files via upload
1 parent 15f582c commit c33aeba

7 files changed

+3109
-0
lines changed
 

‎Assignment 1 teenage gambling and production.Rmd

+402
Large diffs are not rendered by default.

‎Assignment 2.Rmd

+497
Large diffs are not rendered by default.

‎Assignment 3.Rmd

+423
Large diffs are not rendered by default.

‎Assignment-4.Rmd

+355
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
---
2+
title: "Assignment 4"
3+
author: '108048110'
4+
date: "2022-11-12"
5+
output:
6+
pdf_document: default
7+
html_document: default
8+
---
9+
10+
```{r setup, include=FALSE}
11+
knitr::opts_chunk$set(echo = TRUE)
12+
```
13+
14+
## Problem 1
15+
16+
```{r include=FALSE}
17+
library(GGally)
18+
library(lmtest)
19+
require(ellipse)
20+
```
21+
22+
```{r include=FALSE}
23+
data1 = read.table('./datasets/wastes.txt', header=TRUE)
24+
```
25+
26+
This is an experimental data of five laboratory measurements conducted to interpret total oxygen demand in dairy waste. Data were collected on samples kept in suspension in water in a laboratory for 220 days, and we assume that all observations are independent.
27+
28+
| Item | Variable | Description | Unit |
29+
|------|----------|------------------------------|--------------------|
30+
| 1 | y | $log(oxygen \ demand)$ | mg oxygen per unit |
31+
| 2 | x1 | $biological\ oxygen\ demand$ | mg/liter |
32+
| 3 | x2 | $Total\ Kjeldahl\ nitrogen$ | mg/liter |
33+
| 4 | x3 | $Total\ solids$ | mg/liter |
34+
| 5 | x4 | $Total\ volatile\ solids$ | mg/liter |
35+
| 6 | x5 | $Chemical\ oxygen\ demand$ | mg/liter |
36+
37+
: Total oxygen demand in dairy wastes
38+
39+
```{r echo=FALSE}
40+
attach(data1)
41+
summary(data1, cor=T)
42+
ggpairs(data1[,-1])
43+
```
44+
45+
- $x2, x3, x5\ and\ y$ are right skewed, while $x4$ is left skewed.
46+
47+
- From the correlation coefficient table we found that the estimator of $x4$, $\beta_4$ has a very high negative correlation associate with the intercept $(\beta_0)$, which might result from the range of $x4$, which compared to other variables, is relatively small.
48+
49+
- As we can observe from the plot, the values of correlation coefficient between $x1\ and\ x3$, $x1\ and\ x4$, $x1\ and\ x5$, $x3\ and\ x5$, $x4\ and\ x5$ are very high, in addition, the response variable $y$ seemed to have be positively correlated with $x1, x3, x4\ and\ x5$.
50+
51+
### Fit a multiple regression model $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$, using $y$ as the dependent variable and all $x_j’s$ as the independent variables.
52+
53+
Below is the result of fitting a multiple regression model.
54+
55+
```{r echo=FALSE}
56+
model1 = lm(y~x1+x2+x3+x4+x5)
57+
smodel1 = summary(model1, cor=T)
58+
smodel1
59+
```
60+
61+
#### a. Form a 95% confidence interval for $\beta_3$ and again for $\beta_5$.
62+
63+
- 95% Critical value, $df_\Omega=n-p=20-6=14$
64+
65+
```{r echo=FALSE}
66+
qt(0.975, 14)
67+
```
68+
69+
- General Form: $(A\hat\beta - A\beta)^{T} (A(X^{T}X)^{-1}(A\hat\beta-A\beta) / (d * \hat\sigma^2) \le F_{d, n-p}(\alpha)$
70+
- 95% confidence interval for $\beta_3\ and\ \beta_5$.
71+
- If d=1:
72+
73+
$$\frac{(A\hat\beta-A\beta)^2}{(A(X^TX)^{-1}A^T)\hat\sigma^2} \le\ F_{1, n-p}(\alpha) = \frac{|A\hat\beta-A\beta|}{\sqrt(A(X^TX)^{-1}A^T)*\hat\sigma} \le t_{n-p}(\alpha/2)$$
74+
75+
- In this case, $n=20, p=6, \alpha=0.05, t_{14, 0.025}=2.145$.
76+
77+
- 95% CI for $\beta_3$ = $A_3 = (0,0,0,1,0,0), A\hat\beta = \hat\beta_3\ , \ A_5 = (0,0,0,0,0,1), A\hat\beta = \hat\beta_5:$
78+
79+
- $\hat\beta_3 \pm t_{14}(0.025)*s.e.(\hat\beta_3) = 1.278*10^{-4}\pm2.145*7.69*10^{-5}$
80+
81+
- $\hat\beta_4 \pm t_{14}(0.025)*s.e.(\hat\beta_5) = 1.417*10^{-4} \pm 2.145 * 7.375*10^{-5}$
82+
83+
$confidence\ interval = estimate \pm critical\ value * s.e\ of\ estimate$
84+
85+
```{r echo=FALSE}
86+
knitr::kable(confint(model1)[c(4,6),])
87+
```
88+
89+
- Both intervals contain 0, this indicates that the null hypotheses $H_0: \beta_3=0\ and\ H_0: \beta_5=0$ would not be rejected at the 5% significance level. We can see from the summary where tha p-value for $\beta_3\ and\ \beta_5$ are 11.88% and 7.54% respectively, confirming the point.
90+
91+
#### b. Form a confidence interval for $\beta_3+2*\beta_5$.
92+
93+
- In this case, $A\hat\beta = \hat\beta_3+2*\hat\beta_5, A = (0, 0,0,1,0,2)$
94+
95+
```{r echo=FALSE}
96+
A = c(0,0,0,1,0,2)
97+
x = model.matrix(model1)
98+
xtxi = solve(t(x)%*%x)
99+
se = sqrt(t(A)%*%xtxi%*%A)*summary(model1)$sigma
100+
101+
se
102+
```
103+
104+
- $\beta_3+2*\beta_5 = (\hat\beta_3+2*\hat\beta_5) \pm 2.145 * 1.642*10^{-4}$
105+
106+
#### c. Show a 95% C.R. graph for $\beta_3\ and\ \beta_5$. Plot the origin. State the hypothesis test and its outcome.
107+
108+
```{r echo=FALSE}
109+
plot(ellipse(model1, c(4, 6)), type='l')
110+
points(coef(model1)[4], coef(model1)[6], pch=19)
111+
abline(v = confint(model1)[4,], lty=2)
112+
abline(h = confint(model1)[6,], lty=2)
113+
points(0, 0, pch=19, col='red')
114+
```
115+
116+
```{r echo=FALSE}
117+
summary(model1, cor=T)$cor[4, 6]
118+
```
119+
120+
- The seemly parallelism of the major axis of the ellipse to the y-axis and that of the semi-minor axis to the x-axis suggests that the two estimators, $\beta_3, \beta_5$ are uncorrelated.
121+
122+
- We can deduced from the plot that the joint hypothesis $H_0: \beta_3=\beta_5=0$ is not reject because the origin lies right inside the ellipse
123+
124+
- Both of the hypotheses $H_0: \beta_3=0\ and\ H_0: \beta_5=0$ are not rejected either because 0 does lie within the vertical and horizontal dashed lines which represents as the C.I. of $x3\ and\ x5$ respectively.
125+
126+
- We can further conduct some experiments testing on the significance of $\beta_3$ where x5 is included in the model and the significance of $\beta_5$ where $x3$ is included in the model to ensure that both variables $x3, x5$ are not significant when the true model is assumed to be $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$.
127+
128+
```{r echo=FALSE}
129+
model2 = lm(y~x1+x2+x4+x5)
130+
model3 = lm(y~x1+x2+x3+x4)
131+
model4 = lm(y~x1+x2+x4)
132+
133+
anova(model2, model4)
134+
anova(model3, model4)
135+
```
136+
137+
- In conclusion, as we can observe from the above tests and visualization, all of our hypotheses are not rejected, which indicate that neither of the estimators, $\beta_3, \beta_5$ are significant enough to be indispensable for response variable $y$ under the 95% confidence level.
138+
139+
#### d. If a 95% joint confidence region was computed for $\beta_1, \beta_2, \beta_3, \beta_4, \beta_5$, would the origin lie inside or outside the region? Explain.
140+
141+
```{r echo=FALSE}
142+
knitr::kable(confint(model1))
143+
knitr::kable(confint(model1)[,1]<0 & confint(model1)[,2]>0)
144+
```
145+
146+
- As we can observe from the table, if measured individually, none of the predictors are significant; however, combining univariate tests to test on joint relationship may come up with a biased result, thus we need to conduct a test to compute the boundaries of the joint confidence region.
147+
148+
```{r echo=FALSE}
149+
150+
#= cbind(rep(0, 6), diag(5))
151+
152+
X = as.matrix(cbind(rep(1, 20), data1[,c(-1, -7)]))
153+
154+
e <- eigen(t(X)%*%X)
155+
A = t(e$vectors[,-1])
156+
157+
betahat = matrix(model1$coefficients, nrow=6, ncol=1)
158+
beta = matrix(0L, nrow=6, ncol=1)
159+
critical_value = 2*(smodel1$sigma^2)*qf(0.05, 5, 14, lower.tail = TRUE)
160+
```
161+
162+
**Joint Effect**
163+
164+
```{r echo=FALSE}
165+
t(A%*%betahat-A%*%beta)%*%solve(A%*%(solve(t(X)%*%X))%*%t(A))%*%(A%*%betahat-A%*%beta)
166+
```
167+
168+
**Critical Value**
169+
170+
```{r echo=FALSE}
171+
critical_value
172+
```
173+
174+
- After testing the joint effect of $\beta_1, \beta_2, \beta_3, \beta_4, \beta_5$, we found the calculated result is significantly larger than the corresponding critical value $2*\hat\sigma^2* F_{5, 14} = 0.02957$, which implies that the origin $(0,0,0,0,0)$ is lying outside the region.
175+
176+
- Thus, we can reject the null hypothesis that $H_0: \beta_1=\beta_2=\beta_3=\beta_4=\beta_5=0$.
177+
178+
#### e. Suppose non-volatile solids have no linear effect on the response. State a hypothesis that reflects this suspicion, and test it using a C.I. in your answer to one of the above questions. Explain why the chosen confidence interval can be used to do this work.
179+
180+
- Arrange our previous model: $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4*x4+\beta_5*x5+\epsilon$
181+
182+
- Total solids - volatile solids = non-volatile solids: $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*(x3-x4)+\beta_4*x4+\beta_5*x5+\epsilon$
183+
184+
- $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+(\beta_4-\beta_3)*x4+\beta_5*x5+\epsilon$
185+
186+
- $y=\beta_0+\beta_1*x1+\beta_2*x2+\beta_3*x3+\beta_4^{'}*x4+\beta_5*x5+\epsilon,\ where\ \beta_4^{'} = \beta_4-\beta_3$
187+
188+
- As we can observe from the model equation, originally, $x4 \subset x3$, where $\beta_3$ is an estimator for both volatile and non-volatile solids while $\beta_4$ is an estimator for volatile solids, and therefore $\beta_4$ explains both regular parts from $x3\ and\ x4$; yet with the arrangement, now $\beta_3$ becomes the estimator for non-volatile solids which is uncorrelated to predictor $x4$.
189+
190+
- Hence, we hypothesized that $H_0: \beta_3=0$, Since the arrangement is only a linear transformation performed on variable $x3$ which does not change the value of $\beta_3$, I would apply the C.I. value calculated from question **a** to test this hypothesis.
191+
192+
- Set $\beta_3=0$
193+
194+
```{r echo=FALSE}
195+
beta_3 = 0
196+
```
197+
198+
- Introduce C.I. for $\beta_3$ from model1
199+
200+
```{r echo=FALSE}
201+
CI3 = confint(model1)[4,]
202+
CI3
203+
```
204+
205+
- Whether we accept the null hypothesis
206+
207+
```{r echo=FALSE}
208+
(beta_3 > CI3[1])&&(beta_3 < CI3[2])
209+
```
210+
211+
<!-- -->
212+
213+
- Since 0 lies in the C.I. of $\beta_3$, we do not reject the null hypothesis, that is, the non-volatile solids have no linear effect on the response.
214+
215+
```{r echo=FALSE}
216+
detach(data1)
217+
```
218+
219+
## Problem 2
220+
221+
```{r echo=FALSE}
222+
data2 = read.table('http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/houseprices.txt', header=TRUE)
223+
```
224+
225+
This data are a random sample of home sales from Spring 1993 in Albuquerque.
226+
227+
+------------+------------+-------------------------------------+-------------------+
228+
| Item | Variable | Description | Unit |
229+
+============+============+=====================================+===================+
230+
| 1 | Price | Selling price | $\$100$ |
231+
+------------+------------+-------------------------------------+-------------------+
232+
| 2 | SQFT | living space | $feet^2$ |
233+
+------------+------------+-------------------------------------+-------------------+
234+
| 3 | Age | Age of home | $year$ |
235+
+------------+------------+-------------------------------------+-------------------+
236+
| 4 | Features | Number out of 11 features | (dish washer, |
237+
| | | | |
238+
| | | | refrigerator, |
239+
| | | | |
240+
| | | | microwave, |
241+
| | | | |
242+
| | | | disposer, |
243+
| | | | |
244+
| | | | washer, intercom, |
245+
| | | | |
246+
| | | | skylight(s), |
247+
| | | | |
248+
| | | | compactor, |
249+
| | | | |
250+
| | | | dryer, |
251+
| | | | |
252+
| | | | handicap fit, |
253+
| | | | |
254+
| | | | cable TV access) |
255+
+------------+------------+-------------------------------------+-------------------+
256+
| 5 | NE | Located in northeast sector of city | $1\ or\ 0$ |
257+
+------------+------------+-------------------------------------+-------------------+
258+
| 6 | Corner | Corner location | $1\ or\ 0$ |
259+
+------------+------------+-------------------------------------+-------------------+
260+
| 7 | Taxes | Annual taxes | $\$$ |
261+
+------------+------------+-------------------------------------+-------------------+
262+
263+
: Home sales data
264+
265+
#### a. There are a large number of missing values in the $Age$ variable. We could either exclude $Age$ from our models for the selling price or we could keep $Age$ and exclude the cases that have missing values for $Age$. Which choice is better for this data? Explain your reasoning.
266+
267+
- Check how many missing values are in the data.
268+
269+
```{r echo=FALSE}
270+
summary(data2)
271+
```
272+
273+
- As we can observe from the result of the summary, there are 49 na values in variable $Age$ and 10 in variable $Tax$.
274+
275+
- Furthermore, although $Features$ variable description indicated there are 11 different features in the data, yet as we can see from the table, there are only 8 features at most including in our sample data.
276+
277+
```{r echo=FALSE}
278+
nrow(data2)
279+
```
280+
281+
- However, there are only 117 samples in this data, removing all the missing values in $Age$ would indicate dropping nearly half the amount of the observations, which is irrational to give up so much observations; therefore, choosing to exclude variable $Age$ from the dataset is a more economical and reasonable option.
282+
283+
```{r echo=FALSE}
284+
knitr::kable(head(data2[,-3], 6))
285+
data2 = data2[,-3]
286+
data2 = data2[complete.cases(data2),]
287+
knitr::kable(summary(data2))
288+
attach(data2)
289+
```
290+
291+
- New data after excluding $Age$ from the dataset and removing na values from variable $Tax$.
292+
293+
```{r echo=FALSE}
294+
ggpairs(data2)
295+
```
296+
297+
- House $Price$ seems to be positively correlated with $SQFT, Tax$ which maps to our intuition. While the relationship between $Price$ and other variables are unclear.
298+
299+
#### b. Fit a model with $Price$ as the response and $SQFT, Features, NE, Corner,Tax$ as predictors. Form 95% and 99% C.I. for their coefficients. Explain how the p-value for the parameter for $Corner$ relates to whether zero falls in the two corresponding C.I..
300+
301+
$Price = \beta_0+\beta_1*SQFT+\beta_2*Features+\beta_3*NE+\beta_4*Corner+\beta_5*Tax+\epsilon$
302+
303+
```{r echo=FALSE}
304+
model1 = lm(Price~SQFT+factor(Features)+factor(NE)+factor(Corner)+Tax)
305+
model2 = lm(Price~SQFT+Features+NE+Corner+Tax)
306+
model3 = lm(Price~SQFT+Features+factor(NE)+factor(Corner)+Tax)
307+
smodel1 = summary(model1, cor=T)
308+
smodel2 = summary(model2, cor=T)
309+
smodel3 = summary(model3, cor=T)
310+
311+
smodel1
312+
#smodel2
313+
#smodel3
314+
```
315+
316+
- **Note.** Features, Corner and NE are qualitative data, the numeric value are categorical.
317+
318+
```{r echo=FALSE}
319+
knitr::kable(confint(model1, level=0.95))
320+
knitr::kable(confint(model1, level=0.99))
321+
```
322+
323+
- p-value for $Corner$ is not significant whether the significance level is 95% or 99%. Since zero falls in both $\alpha=5\%, \alpha=1\%$ confidence intervals.
324+
325+
#### c. Predict the $Price$ of a specific house with $SQFT=2500, Features=5, NE=1, Corner=1, Tax=120$. Give an appropriate 95% C.I..
326+
327+
$$Price = 154.548+0.253*SQFT-34.356*F_1-44.180*F_2-31.409*F_3-44.753*F_4+8.071*F_5+35.905*F_6+98.549*F_7+118.434*F_8-4.563*NE+-83.752*Corner+0.690*Tax+\epsilon$$
328+
329+
- **Note.** $If\ Featurs=x, F_x = 1, x\in[1,8]$
330+
331+
```{r echo=FALSE}
332+
predict(model1, data.frame(SQFT=2500, Features=5, NE=1, Corner=1, Tax=1200), interval="confidence")
333+
```
334+
335+
- This is an interpolation prediction computing for the C.I. of mean response.
336+
337+
#### d. Suppose you are only told that $SQFT=2500$. Predict the $Price$ and 95% C.I..
338+
339+
##### When all variables have low correlation:
340+
341+
```{r}
342+
model2 = lm(Price~., data2)
343+
x = data.frame(SQFT=2500, Features = median(Features),NE=which.max(table(NE)),Corner = which.max(table(Corner)),Tax = mean(Tax))
344+
345+
predict(model2, newdata=x,interval="predict")
346+
```
347+
348+
- Conditional Probability
349+
350+
##### Other method:
351+
352+
```{r}
353+
fit2.3 = lm(Price ~ SQFT, data2)
354+
predict(fit2.3, newdata=data.frame(SQFT=2500),interval="predict")
355+
```

‎Assignment-5-GLS.Rmd

+403
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,403 @@
1+
---
2+
title: "Assignment 5"
3+
author: '108048110'
4+
date: "2022-11-27"
5+
output:
6+
pdf_document: default
7+
html_document: default
8+
---
9+
10+
```{r setup, include=FALSE}
11+
knitr::opts_chunk$set(echo = TRUE)
12+
```
13+
14+
```{r echo=FALSE}
15+
library(GGally)
16+
library(nlme)
17+
```
18+
19+
## Q1. Predicting the height of the son from the height of the father.
20+
21+
```{r echo=FALSE}
22+
height = read.table('datasets/height.txt', header=T)
23+
summary(height)
24+
```
25+
26+
- The father's heights were rounded to the nearest inch.
27+
28+
- The average height of the son for fathers of that height is given.
29+
30+
- The number of fathers in each category is given.
31+
32+
```{r echo=FALSE}
33+
attach(height)
34+
x = cbind(rep(1, nrow(height)), FatherH)
35+
y = SonmeanH
36+
```
37+
38+
### *i.* Construct a linear regression model for predicting the height of the son from the height of the father in the best manner given information available.
39+
40+
- There are several observations given a single value of $x$, therefore, I will use the generalized least square to compute the weights of each predictor to estimate the parameters for the rest of the questions.
41+
42+
- Since the observed responses $y_i's$ are actually averages of several observations `number of fathers`, we set the wight as the number of observations.
43+
44+
#### Hard Coding Ver
45+
46+
- Constructing $\Sigma$ and $\Sigma^{-1}$ first.
47+
48+
```{r echo=FALSE}
49+
Sig = diag(12)
50+
Sig[Sig==1] = 1/NumofFather
51+
Sigi = solve(Sig)
52+
```
53+
54+
```{r echo=FALSE}
55+
Sig
56+
```
57+
58+
- Since $w_i's$ is proportion to $1/var(\epsilon)$
59+
60+
```{r echo=FALSE}
61+
Sigi
62+
```
63+
64+
- $S=diag(1/{\sqrt w_1}, 1/{\sqrt w_2}, ...)$, then $\Sigma = SS^T$.
65+
66+
```{r echo=FALSE}
67+
S = chol(Sig)
68+
Si = solve(S)
69+
sx = Si%*%x
70+
sy = Si%*%y
71+
```
72+
73+
- Then we can use OLS to regress $S^{-1}Y$ on $S^{-1}X$.
74+
75+
```{r}
76+
g = lm(sy~sx-1)
77+
summary(g, cor=T)
78+
```
79+
80+
#### Wighted Least Square ver
81+
82+
- $w_i = n_i$, $S^{-1} = \sqrt(w_i)$
83+
84+
```{r}
85+
Sigi = sqrt(NumofFather)
86+
g1 = lm(SonmeanH~FatherH, weights = NumofFather)
87+
summary(g1, cor=T)
88+
```
89+
90+
- Though the result seems indifference; however, the latter method included intercept in the model, therefore, its calculated $R^2$ value will be more vigorous compared to the former hard coding approach.
91+
92+
```{r echo=FALSE}
93+
plot(FatherH, g1$res, cex=2, main='residual plot')
94+
abline(0,0)
95+
```
96+
97+
### *ii.* Can the model be simplified to $Son\ height =\ father\ height+\epsilon$?
98+
99+
- Can we predict the son's height based on the father's height?
100+
- $H_0: yes\ ;\ H_1: no$
101+
- Try fitting the regression without weights first.
102+
- Since the estimated parameters for $\beta_0$ is 1, we regarded father height as offset.
103+
104+
```{r}
105+
g2 = lm(SonmeanH~offset(FatherH)-1, weights = NumofFather)
106+
summary(g2)
107+
```
108+
109+
- Comparing two fits.
110+
111+
- Solid line: fitted line of model with the calculated weights.
112+
113+
- Dash line: the simplified model with the height of the father to be the only predictor.
114+
115+
```{r echo=FALSE}
116+
par(mfrow=c(1,2))
117+
plot(FatherH, g1$res, main="with weight model resiudual")
118+
abline(0,0)
119+
plot(FatherH, g2$res, main="without weight model resiudual")
120+
abline(0,0)
121+
122+
anova(g1, g2)
123+
```
124+
125+
- Comparing their residual variance to test if the given model good enough for the prediction.
126+
- the F-statistics with the null hypothesis to be "there isn't much different between two models", and the alternative hypothesis to be "there is a significance difference between two models."
127+
128+
```{r echo=FALSE}
129+
ts = 10*1.783^2
130+
ts
131+
132+
1-pchisq(ts, 10)
133+
```
134+
135+
- The p-value larger than the significance level indicates that we reject the null hypothesis, that is, the model can not be simplified.
136+
137+
```{r echo=FALSE}
138+
detach(height)
139+
```
140+
141+
## Q2. Ultrasonic measurements of the depths of defects in the Alaska pipeline in the field and in the lab.
142+
143+
```{r echo=FALSE}
144+
data1 = read.table('http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/pipeline.txt', header=T)
145+
attach(data1)
146+
```
147+
148+
- Take a look at the data
149+
150+
```{r echo=FALSE}
151+
summary(data1)
152+
```
153+
154+
```{r echo=FALSE}
155+
knitr::kable(table(Batch))
156+
```
157+
158+
- Batch effect is not significant and can be ignored.
159+
160+
- Lab measurements are more accurate than that of measured in the field.
161+
162+
### *i.* Fit a regression model Lab\~Field. And check for non-constant variance.
163+
164+
```{r}
165+
g = lm(Lab~Field)
166+
summary(g)
167+
```
168+
169+
```{r echo=FALSE}
170+
plot(Lab, Field)
171+
```
172+
173+
- It is obvious in the plot that the variance is not constant.
174+
175+
- Also, there are replicate measurements in this dataset.
176+
177+
```{r echo=FALSE}
178+
knitr::kable(data1[Field==18,])
179+
```
180+
181+
```{r echo=FALSE}
182+
plot(g, which=1)
183+
```
184+
185+
- We can see that the variability around 0 increases as we move further to the right with bigger fitted values.
186+
187+
- Hence, we can confidently conclude that the data does not have a constant variance.
188+
189+
We can also use ***NCV-test*** to examine the property of non-constant variance.
190+
191+
$H_0: error\ terms\ have\ constant\ variance\ ;\ H_1: error\ terms\ have\ non-constant\ variance.$
192+
193+
```{r echo=FALSE}
194+
library(car)
195+
ncvTest(g)
196+
```
197+
198+
- As we can observe from the result, the p-value of NCV-test is small, under the circumstances of $\alpha=0.05$ significance level, we reject $H_0$, that is, the model has the non-constant variance.
199+
200+
### *ii.* Use weights to account for the non-constant variance.
201+
202+
> Suppose we assume that the variance in the response is linked to the predictor $var(Lab) = \alpha_0*Field^{\alpha_1}$.
203+
204+
> Regress $log(var(Lab))$ on $log(mean(Field))$ to estimate $\alpha_0, \alpha_1$.
205+
206+
> Use this to determine weights in a WLS fit of `Lab` on `Field`. Show the regression summary.
207+
208+
- Computation splitting the range of `Field` into 12 groups of size 9 except for the last group that has only 8 values.
209+
210+
```{r include=FALSE}
211+
i <- order(Field)
212+
npipe = data1[i,]
213+
214+
ff <- gl(12,9)[-108]# 12 groups with 9 values each
215+
216+
# mean of field
217+
meanfield <- unlist(lapply(split(npipe$Field,ff),mean))
218+
219+
# variance of lab
220+
varlab <- unlist(lapply(split(npipe$Lab,ff),var))
221+
222+
# removing the last point
223+
meanfield = meanfield[-12]
224+
varlab = varlab[-12]
225+
```
226+
227+
- $log(var(Lab)) = log(\alpha_0)+\alpha_1log(Field)$
228+
229+
```{r}
230+
g4 = lm(log(varlab)~log(meanfield))
231+
summary(g4)
232+
```
233+
234+
- $\alpha_0 = e^{-1.9352} = 0.1444,\ \alpha_1 = 1.6707$
235+
236+
- $var(Lab)=0.1444*Field^{1.6707}$
237+
238+
- $\sigma = \sqrt(0.1444) = 0.38,\ w_i = \frac{1}{\alpha_0*Field^{\alpha_1}}$
239+
240+
```{r}
241+
w = 1/Field^(1.6707)
242+
g1 = lm(Lab~Field, weights = w)
243+
summary(g1)
244+
```
245+
246+
```{r echo=FALSE}
247+
par(mfrow = c(1,2), mar = c(5,4,3,1))
248+
plot(g1, which = 1)
249+
plot(g1$fitted.values, g1$residuals*sqrt(w), main = "Residuals*sqrt(w) vs Fitted", xlab = "Fitted values", ylab = "Residuals * sqrt(w)")
250+
abline(a = 0, b = 0, lty = "dashed", col = "#c4c4c4")
251+
```
252+
253+
- As we can observe from the residual plot on the left, the residuals of the new model still become larger as the fitted values become larger.
254+
255+
- Multiply residuals by $\sqrt(weight)$ to get the result of Fitted value as shown on the right. It is observed that $residuals*\sqrt(weight)$ does not increase with the increase of fitted values, and the problem of non-constant variance is solved.
256+
257+
- Check on NCV-test
258+
259+
```{r}
260+
ncvTest(g1)
261+
```
262+
263+
```{r echo=FALSE}
264+
detach(data1)
265+
```
266+
267+
## Q3. This dataset provides the data on the outside diameter of crankpins produced by an industrial process.
268+
269+
```{r echo=FALSE}
270+
data2 = read.table('http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/crank.txt', header=T)
271+
attach(data2)
272+
```
273+
274+
- All the crankpins should be between 0.7425 and 0.7430 inches.
275+
276+
- The number given in the table are in units of 0.00001 inches deviation from 0.742 inches.
277+
278+
```{r}
279+
summary(data2)
280+
```
281+
282+
- So the actual outside diameter of crankpins ranges from 0.74272 to 0.743 inches.
283+
284+
- Under control, the average size of the crankpin produced should...
285+
286+
1. fall near the middle of the specified range
287+
2. not depend on the time.
288+
289+
- Fit an appropriate model to see of the process is under control and test for lack of fit in the model.
290+
291+
```{r}
292+
plot(day,diameter,xlab="day",ylab="diameter")
293+
abline(h=0.74275,col=8)
294+
abline(h=0.743,col=8,lty=2)
295+
```
296+
297+
### Under control test
298+
299+
- $H_1: g1=not\ under\ control$
300+
301+
```{r echo=FALSE}
302+
g = lm(diameter~1)
303+
```
304+
305+
- This is a dataset with replicate values, which show a pattern of unequal variance.
306+
- Calculate the sample variance.
307+
308+
```{r echo=FALSE}
309+
dayvar = c()
310+
daymean = c()
311+
312+
for(i in seq(1,22,3)){dayvar = c(dayvar, var(data2[day==i,]$diameter))}
313+
314+
for(i in seq(1,22,3)){daymean = c(daymean, mean(data2[day==i,]$diameter))}
315+
```
316+
317+
- The data is obviously with inconsistent variation.
318+
319+
- Hence, we calculated weights as 1/sample variance.
320+
321+
```{r echo=FALSE}
322+
w = 1/daymean
323+
knitr::kable(rbind(unique(day),rbind(dayvar, rbind(daymean, w))))
324+
data2['var'] = rep(dayvar, each=5)
325+
```
326+
327+
- Calculated $\Sigma$
328+
329+
```{r echo=FALSE}
330+
#w = 1/data2['var']
331+
```
332+
333+
- Cholesky Decomposition and fit the model.
334+
335+
```{r}
336+
S = chol(Sig)
337+
Si = solve(S)
338+
339+
g1 = lm(daymean~unique(day), weights=w)
340+
summary(g1, cor=T)
341+
```
342+
343+
```{r echo=FALSE}
344+
plot(day, diameter, main='H1: Not Under control', cex=2, ylim=c(70, 102))
345+
abline(g1)
346+
abline(h=(50+100)/2, col='red')
347+
```
348+
349+
- If the process is under control, the average size of the production diameter should fall within the range of 75 (50+100/2), and the result should not be affected by predictor, day.
350+
351+
```{r echo=FALSE}
352+
data2['daymean'] = seq(1, nrow(data2), 1)
353+
plot(data2$daymean, data2$diameter)
354+
abline(h=75, col='blue')
355+
```
356+
357+
```{r echo=FALSE}
358+
pred = predict(g1, data.frame(c(75, 0, 0, 0, 0, 0, 0, 0)))
359+
knitr::kable(pred)
360+
plot(day, diameter, ylim=c(50, 100))
361+
abline(h=75, col='blue')
362+
points(unique(day), pred, pch=17, cex=1)
363+
points(unique(day), daymean, col='red', pch=19, cex=1)
364+
```
365+
366+
- The red dots indicate the mean of each repetition observations, and the black triangular dots represent the mean prediction of model without considering day variable (which is the under control model.)
367+
368+
```{r echo=FALSE}
369+
g = lm(daymean~1)
370+
anova(g, g1)
371+
```
372+
373+
- Since there is a significant difference between the null hypothesis model and the alternative hypothesis model, we conclude that we should reject the null hypothesis, that is, we do not have enough evidence to prove that the model is under control.
374+
375+
### Test for lack of fit
376+
377+
- Saturated model
378+
379+
```{r}
380+
g2 = lm(diameter~factor(day))
381+
summary(g2)
382+
```
383+
384+
- The saturated model has 8 parameters
385+
386+
- The residual standard error is 5.16, which is the pure error estimate of true $\sigma$.
387+
388+
```{r echo=FALSE}
389+
plot(day, diameter)
390+
abline(g1, col='red')
391+
points(data2$day, g2$fitted.values, pch=17, cex=1)
392+
grid <- seq(1, 22, length=8); lines(grid, predict(g2, data.frame(day=grid)), lty=2)
393+
```
394+
395+
```{r echo=FALSE}
396+
anova(g1, g2)
397+
```
398+
399+
- The p-value is above 0.05, we do not reject the null hypothesis that these is no lack of fit.
400+
401+
```{r}
402+
detach(data2)
403+
```

‎Assignment-6 Diagnostic.Rmd

+553
Large diffs are not rendered by default.

‎Assignment-7 Polynomial regression.Rmd

+476
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)
Please sign in to comment.