Skip to content

Commit c4f1b3a

Browse files
authored
Add files via upload
1 parent de612f6 commit c4f1b3a

30 files changed

+5337
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
- he gives a broad topic/keyword and you have to know the main formulas and describe it (but not every single one, around 20% should suffice)
2+
- you can do the exam 3 times, then it's commisional
3+
- you have 2 years to pass the exams, then you have to repeat the course exercises
4+
- oral mode:
5+
- 30min slots, 2-3 students, 2-3 questions
6+
- blank page for formulas
7+
- written mode:
8+
- 30 min slots, 3-5 questions, 4-6 min time per question
9+
- all students in the exam hall
10+
- questions will be dictated and answered on the paper
11+
- questions are more specific so you don't have much to write
12+
13+
*formulas*
14+
15+
- 1-2 most important formulas
16+
- objective functions, cost functions, minimization criteria, solution
17+
18+
*concepts*
19+
20+
- explain "what it does", "how" and what the involved variables are
21+
- explain quantities, optimization problems, side conditions, the solution
22+
- compare to other or similar methods that were covered

amrc - advanced methods for regression and classification 105.707/exams.md

+340
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# this container just serves to install the `klaR` package
2+
#
3+
# usage:
4+
#
5+
# $ docker compose up -d
6+
# $ docker compose exec main Rscript -e "rmarkdown::render('ex7.rmd', output_format = 'pdf_document')"
7+
#
8+
# or alternatively:
9+
#
10+
# $ brew install gcc
11+
# $ mkdir -p ~/.R
12+
# $ echo "FC = /opt/homebrew/bin/gfortran # check with `which gfortran`
13+
# $ F77 = /opt/homebrew/bin/gfortran
14+
# $ FLIBS = -L/opt/homebrew/lib" > ~/.R/Makevars
15+
#
16+
# and then just:
17+
#
18+
# ```R
19+
# Sys.setenv(PATH=paste("/opt/homebrew/bin", Sys.getenv("PATH"), sep=":"))
20+
# install.packages(c("classInt", "questionr", "klaR"), type="source")
21+
# ```
22+
23+
services:
24+
main:
25+
container_name: main
26+
volumes:
27+
- type: bind
28+
source: .
29+
target: /workspace
30+
working_dir: /workspace
31+
ports:
32+
- '8888:8888'
33+
build:
34+
context: .
35+
dockerfile_inline: |
36+
FROM --platform=linux/amd64 ubuntu:20.04
37+
38+
ARG DEBIAN_FRONTEND=noninteractive
39+
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get update
40+
RUN apt-get install -y ca-certificates
41+
RUN update-ca-certificates
42+
43+
# pandoc
44+
RUN apt-get install -y --no-install-recommends pandoc texlive-latex-base texlive-fonts-recommended
45+
RUN apt-get install -y --no-install-recommends texlive-latex-extra texlive-xetex texlive-lang-japanese
46+
RUN apt-get install -y --no-install-recommends texlive-fonts-extra texlive-lang-cjk texlive-latex-recommended texlive-bibtex-extra
47+
RUN apt-get install -y --no-install-recommends texlive-science texlive-pictures biber libfreetype6-dev libfontconfig1-dev
48+
RUN apt-get install -y --no-install-recommends lmodern
49+
50+
# fortran 9
51+
RUN apt-get install -y gfortran-9
52+
RUN update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-9 100
53+
54+
# r
55+
RUN apt-get clean
56+
RUN apt-get autoremove
57+
RUN apt-get update
58+
RUN apt-get install -y --no-install-recommends r-base r-base-dev
59+
RUN apt-get install -y automake build-essential zlib1g-dev r-cran-httpuv
60+
61+
# rmd
62+
RUN Rscript -e 'pkgs <- c("rmarkdown", "ISLR", "IRkernel", "knitr", "kableExtra"); for (p in pkgs) if (!requireNamespace(p, quietly = TRUE)) install.packages(p, repos = "https://cran.rstudio.com")'
63+
64+
# klaR
65+
RUN Rscript -e 'pkgs <- c("klaR", "e1071", "MASS", "ggplot2", "dplyr", "tidyr", "readr", "stringr"); for (p in pkgs) if (!requireNamespace(p, quietly = TRUE)) install.packages(p, repos = "https://cran.rstudio.com")'
66+
67+
CMD ["tail", "-f", "/dev/null"]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
---
2+
title: "Exercise 1"
3+
subtitle: "105.707 Advanced Methods for Regression and Classification, 2024W"
4+
author: "11912007 - Yahya Jabary"
5+
date: "`r format(Sys.time(), '%d.%m.%Y')`"
6+
output:
7+
pdf_document:
8+
number_sections: true
9+
toc: true
10+
toc_depth: 3
11+
documentclass: article
12+
papersize: a4
13+
fontsize: 8pt # find a way to go lower
14+
geometry:
15+
- top=10mm
16+
- bottom=15mm
17+
- left=10mm
18+
- right=10mm
19+
header-includes:
20+
# smaller code
21+
- \usepackage{fancyvrb}
22+
- \fvset{fontsize=\scriptsize}
23+
- \usepackage{listings}
24+
- \lstset{basicstyle=\scriptsize\ttfamily}
25+
- \RecustomVerbatimEnvironment{verbatim}{Verbatim}{fontsize=\scriptsize}
26+
---
27+
28+
\newpage
29+
30+
```{r, echo=FALSE, message=FALSE, warning=FALSE}
31+
deps <- c("ISLR", "Metrics", "ggplot2", "tidyr", "microbenchmark", "dplyr", "patchwork", "gridExtra")
32+
for (p in deps) {
33+
if (!requireNamespace(p, quietly = TRUE)) {
34+
install.packages(p, repos = "https://cran.rstudio.com")
35+
}
36+
library(p, character.only = TRUE)
37+
}
38+
39+
set.seed(42)
40+
options(scipen=999)
41+
```
42+
43+
# Preprocessing
44+
45+
**Question:** *Is any preprocessing necessary or useful? Argue why a log-transformation of the response variable can be useful.*
46+
47+
We don't have any missing values to handle.
48+
49+
But a log-transformation of the response variable `Apps` can be useful for:
50+
51+
- Fixing skewed data: Log-transformation can help reduce this right-skewness and make the distribution more symmetric. The summary statistics show that the `Apps` variable is highly skewed, with a median (1558) much lower than the mean (3002) and a very large maximum value (48094) compared to the minimum (81).
52+
- Scale compression: Log-transformation compresses the scale of variables that span several orders of magnitude, as is the case with Apps (from 81 to 48094).
53+
- Variance stabilization, outlier impact reduction: Log transformations are useful when you have data that's spread out over a wide range. It helps bring very large numbers closer together and spreads out very small numbers, making the data easier to work with.
54+
- Linearization of relationships: Log transforming can often make relationships simpler and more linear, which can improve the fit of linear regression models.
55+
- Interpretation in terms of relative changes: When using log-transformed data, coefficients can be interpreted in terms of percentage changes rather than absolute changes. This can be more intuitive sometimes.
56+
57+
There are some caveats to keep in mind however:
58+
59+
- It doesn't make sense for all variables.
60+
- After log transforming, you'll need to interpret your results differently (e.g., transforming back to the original scale).
61+
- You can't log transform zero or negative numbers directly.
62+
63+
```{r, echo=TRUE, fig.width=15, fig.height=4}
64+
data(College)
65+
66+
# no values to drop
67+
cat("num of missing vals: ", sum(is.na(College)), "\n")
68+
69+
# log transform fixes skewness
70+
before <- ggplot(College, aes(Apps, Grad.Rate)) + geom_point() + theme_minimal() + labs(title = "Applications vs. Graduation Rate")
71+
after <- ggplot(College, aes(log(Apps), Grad.Rate)) + geom_point() + theme_minimal() + labs(title = "Log(Applications) vs. Graduation Rate")
72+
print(before + after)
73+
summary(College$Apps)
74+
```
75+
76+
# Training Full model
77+
78+
**Question:** *Interpret the outcome of `summary(res)`, where `res` is the output from the `lm()` function. Which variables contribute to explaining the response variable?*
79+
80+
The significance codes are denoted by the number of stars next to the variable names. They show the significance to explaining the response variable `log(Apps)` in the model. Some are highly significant with p-values as low as 2e-16, while others are significant with p-values less than 0.01.
81+
82+
The model has an Adjusted R-squared of 0.7155, indicating that it explains about 71.55% of the variance in `log(Apps)`.
83+
84+
**Question:** *Look at diagnostics plots with `plot(res)`. Are the model assumptions fulfilled?*
85+
86+
- Residuals vs Fitted: There's a slight curve in the red line, suggesting some non-linearity might be present.
87+
Q-Q Residuals: The residuals mostly follow the diagonal line, indicating approximate normality, with some deviation at the tails.
88+
- Scale-Location: There's a slight upward trend, suggesting some heteroscedasticity (meaning the spread of errors / residuals is not constant across all values of the predictor variables).
89+
- Residuals vs Leverage: No points fall outside Cook's distance (dotted lines), indicating no highly influential points (outliers) in the data.
90+
91+
While not perfect, the model assumptions are reasonably well met, with some minor concerns about non-linearity and heteroscedasticity.
92+
93+
**Question:** *How is R handling binary variables (Private), and how can you interpret the corresponding regression coefficient?*
94+
95+
R handles the binary variable `Private` using dummy coding. The coefficient for `PrivateYes` represents the average difference in `log(Apps)` between private and public institutions, holding other variables constant. Private institutions are associated with approximately $\exp(-0.630) = 0.532 = 53\%$ fewer applications compared to public institutions, all else being equal.
96+
97+
**Question:** *Compare the resulting coefficients with those obtained from `lm()`. Do you get the same result?*
98+
99+
The test case passes with a tolerance of 1e-10.
100+
101+
Note: We get the warning "names for current but not for target" which means they are not in the same order - but that doesn't matter for the comparison.
102+
103+
**Question:** *What do you think about the prediction performance of your model (using a graphical representation)?*
104+
105+
The scatter plots of observed vs. predicted applications show a strong positive correlation between observed and predicted values for both training and test data.
106+
107+
We can also see that the model tends to underpredict for very high numbers of applications, especially in the test data. Prediction accuracy seems slightly better for the training data compared to the test data, which is expected.
108+
109+
**Question:** *Compute the RMSE separately for training and test data, and compare the values. What do you conclude?*
110+
111+
The model didn't generalize well to the test data, with a difference of $\approx 911$ between the training and test RMSE values.
112+
113+
```{r, echo=TRUE, fig.width=15, fig.height=8}
114+
data(College)
115+
116+
# holdout 2/3 split
117+
train_index <- sample(1:nrow(College), 2/3 * nrow(College))
118+
train_data <- College[train_index, ]
119+
test_data <- College[-train_index, ]
120+
121+
# fit model
122+
full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data)
123+
summary(full_model)
124+
par(mfrow=c(2,2))
125+
plot(full_model)
126+
127+
# solve analytically
128+
X <- model.matrix(~ . - Accept - Enroll - Apps, data = train_data) # model matrix X
129+
y <- log(train_data$Apps) # response vector y
130+
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y # get LS estimate of beta analytically
131+
all.equal(as.vector(beta_hat), coef(full_model), 1e-10) # compare coeffs
132+
133+
# get rmse error
134+
train_error <- rmse(train_data$Apps, exp(predict(full_model)))
135+
test_error <- rmse(test_data$Apps, exp(predict(full_model, newdata = test_data)))
136+
cat("train RMSE: ", train_error, "vs. test RMSE: ", test_error, "\n")
137+
```
138+
```{r, echo=TRUE, fig.width=15, fig.height=3}
139+
# plot observed vs predicted
140+
train_plot_data <- data.frame(Observed = train_data$Apps, Predicted = exp(predict(full_model, newdata = train_data)))
141+
train_plot <- ggplot(train_plot_data, aes(x = Observed, y = Predicted)) +
142+
geom_point(alpha = 0.5) +
143+
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
144+
scale_x_log10() + scale_y_log10() +
145+
labs(title = "Training Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") +
146+
theme_minimal()
147+
test_plot_data <- data.frame(Observed = test_data$Apps, Predicted = exp(predict(full_model, newdata = test_data)))
148+
test_plot <- ggplot(test_plot_data, aes(x = Observed, y = Predicted)) +
149+
geom_point(alpha = 0.5) +
150+
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
151+
scale_x_log10() + scale_y_log10() +
152+
labs(title = "Test Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") +
153+
theme_minimal()
154+
combined_plot <- train_plot + test_plot + plot_layout(ncol = 2) + plot_annotation(title = "Observed vs Predicted Applications")
155+
print(combined_plot)
156+
```
157+
158+
# Training Reduced model
159+
160+
**Question:** *Are now all input variables significant in the model? Why is this not to be expected in general?*
161+
162+
In the reduced model, all input variables are now significant at the 0.05 level or better.
163+
164+
However, this is not generally expected for several reasons:
165+
166+
- Variable Selection Bias: By selecting only significant variables from the full model, we introduce a selection bias that can artificially inflate the significance of the remaining variables.
167+
- Overfitting: The reduced model may be overfitting to the specific dataset, capturing noise rather than true underlying relationships.
168+
- Multicollinearity (meaning that some variables are highly correlated with each other): Removing some variables can change the relationships between the remaining variables, potentially altering their significance levels.
169+
170+
**Question:** *Compute the RMSE for the new model. What would we expect?*
171+
172+
The reduced model's RMSE values are: 7549.269 (Train), 8459.517 (Test). These values are higher than those of the full model 4332.683 (Train), 2288.898 (Test).
173+
174+
This increase in RMSE is expected because (1) the reduced model has fewer predictors, potentially losing some explanatory power and (2) the full model may have been capturing some important relationships that are now omitted.
175+
176+
However, the reduced model could potentially generalize better to new data, despite the higher RMSE on the current test set given its lower complexity.
177+
178+
**Question:** *Compare the two models with `anova()`. What can you conclude?*
179+
180+
The ANOVA test comparing the reduced and full models yields a highly significant result (p-value < 0.001). This indicates that there is a statistically significant difference between the two models / the full model explains significantly more variance in the data than the reduced model.
181+
182+
However, it's important to note that statistical significance doesn't always translate to practical significance.
183+
184+
In conclusion the reduced model, while explaining less variance and performing worse on the test set might be more interpretable and generalizable. The choice between the two models would depend on the specific goals of the analysis, balancing between model complexity, interpretability and predictive power.
185+
186+
```{r, echo=TRUE, fig.width=15, fig.height=4}
187+
data(College)
188+
189+
# holdout 2/3 split
190+
train_index <- sample(1:nrow(College), 2/3 * nrow(College))
191+
train_data <- College[train_index, ]
192+
test_data <- College[-train_index, ]
193+
194+
# get coeffs and p-values < 0.05
195+
full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data)
196+
summary_full <- summary(full_model)
197+
coef_summary <- summary_full$coefficients
198+
significant_vars <- rownames(coef_summary)[coef_summary[, "Pr(>|t|)"] < 0.05]
199+
significant_vars <- significant_vars[!grepl("^\\(Intercept\\)|Yes$", significant_vars)] # drop intercept
200+
reduced_formula <- as.formula(paste("log(Apps) ~", paste(significant_vars, collapse = " + ")))
201+
# fit reduced model
202+
reduced_model <- lm(reduced_formula, data = train_data)
203+
summary(reduced_model)
204+
205+
# plot observed vs predicted
206+
train_data_plot <- data.frame(Observed = train_data$Apps, Predicted = exp(predict(reduced_model, train_data)))
207+
p1 <- ggplot(train_data_plot, aes(x = Observed, y = Predicted)) +
208+
geom_point(alpha = 0.5) +
209+
geom_abline(intercept = 0, slope = 1, color = "red") +
210+
scale_x_log10() + scale_y_log10() +
211+
labs(title = "Training Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") +
212+
theme_minimal()
213+
test_data_plot <- data.frame(Observed = test_data$Apps, Predicted = exp(predict(reduced_model, test_data)))
214+
p2 <- ggplot(test_data_plot, aes(x = Observed, y = Predicted)) +
215+
geom_point(alpha = 0.5) +
216+
geom_abline(intercept = 0, slope = 1, color = "red") +
217+
scale_x_log10() + scale_y_log10() +
218+
labs(title = "Test Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") +
219+
theme_minimal()
220+
grid.arrange(p1, p2, ncol = 2)
221+
222+
# get rmse error
223+
train_error <- rmse(train_data$Apps, exp(predict(reduced_model)))
224+
test_error <- rmse(test_data$Apps, exp(predict(reduced_model, newdata = test_data)))
225+
cat("train RMSE: ", train_error, "vs. test RMSE: ", test_error, "\n")
226+
227+
# anova
228+
anova_result <- anova(reduced_model, full_model)
229+
print(anova_result)
230+
```
231+
232+
# Variable selection
233+
234+
Finally we perform forward and backward selection to find the best model and visualize the results.
235+
236+
```{r, echo=TRUE, fig.width=15, fig.height=8}
237+
data(College)
238+
College <- na.omit(College)
239+
240+
# holdout 2/3 split
241+
train_index <- sample(1:nrow(College), 2/3 * nrow(College))
242+
train_data <- College[train_index, ]
243+
test_data <- College[-train_index, ]
244+
245+
# forward / backward selection
246+
empty_model <- lm(log(Apps) ~ 1, data = train_data)
247+
full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data)
248+
forward_model <- step(empty_model, direction = "forward", scope = formula(full_model), trace = 0)
249+
backward_model <- step(full_model, direction = "backward", trace = 0)
250+
251+
# rmse
252+
calculate_rmse <- function(model, data) {
253+
predictions <- exp(predict(model, newdata = data))
254+
actual <- data$Apps
255+
rmse <- sqrt(mean((actual - predictions)^2))
256+
return(rmse)
257+
}
258+
cat("forward selection RMSE:", calculate_rmse(forward_model, train_data), "(train) vs.", calculate_rmse(forward_model, test_data), "(test)\n")
259+
cat("backward selection RMSE:", calculate_rmse(backward_model, train_data), "(train) vs.", calculate_rmse(backward_model, test_data), "(test)\n")
260+
261+
# plot
262+
create_plot <- function(model, data, title) {
263+
predictions <- exp(predict(model, newdata = data))
264+
ggplot(data.frame(observed = data$Apps, predicted = predictions), aes(x = observed, y = predicted)) +
265+
geom_point(alpha = 0.5, color = "black") +
266+
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + scale_x_log10() + scale_y_log10() +
267+
labs(title = title, x = "Observed (log scale)", y = "Predicted (log scale)") +
268+
theme_minimal() +
269+
theme(plot.title = element_text(size = 10, face = "bold"))
270+
}
271+
p1 <- create_plot(forward_model, train_data, "Forward Selection (Train)")
272+
p2 <- create_plot(forward_model, test_data, "Forward Selection (Test)")
273+
p3 <- create_plot(backward_model, train_data, "Backward Selection (Train)")
274+
p4 <- create_plot(backward_model, test_data, "Backward Selection (Test)")
275+
combined_plot <- (p1 + p2) / (p3 + p4)
276+
print(combined_plot)
277+
```

0 commit comments

Comments
 (0)