|
| 1 | +--- |
| 2 | +title: "Exercise 1" |
| 3 | +subtitle: "105.707 Advanced Methods for Regression and Classification, 2024W" |
| 4 | +author: "11912007 - Yahya Jabary" |
| 5 | +date: "`r format(Sys.time(), '%d.%m.%Y')`" |
| 6 | +output: |
| 7 | + pdf_document: |
| 8 | + number_sections: true |
| 9 | + toc: true |
| 10 | + toc_depth: 3 |
| 11 | +documentclass: article |
| 12 | +papersize: a4 |
| 13 | +fontsize: 8pt # find a way to go lower |
| 14 | +geometry: |
| 15 | + - top=10mm |
| 16 | + - bottom=15mm |
| 17 | + - left=10mm |
| 18 | + - right=10mm |
| 19 | +header-includes: |
| 20 | + # smaller code |
| 21 | + - \usepackage{fancyvrb} |
| 22 | + - \fvset{fontsize=\scriptsize} |
| 23 | + - \usepackage{listings} |
| 24 | + - \lstset{basicstyle=\scriptsize\ttfamily} |
| 25 | + - \RecustomVerbatimEnvironment{verbatim}{Verbatim}{fontsize=\scriptsize} |
| 26 | +--- |
| 27 | + |
| 28 | +\newpage |
| 29 | + |
| 30 | +```{r, echo=FALSE, message=FALSE, warning=FALSE} |
| 31 | +deps <- c("ISLR", "Metrics", "ggplot2", "tidyr", "microbenchmark", "dplyr", "patchwork", "gridExtra") |
| 32 | +for (p in deps) { |
| 33 | + if (!requireNamespace(p, quietly = TRUE)) { |
| 34 | + install.packages(p, repos = "https://cran.rstudio.com") |
| 35 | + } |
| 36 | + library(p, character.only = TRUE) |
| 37 | +} |
| 38 | +
|
| 39 | +set.seed(42) |
| 40 | +options(scipen=999) |
| 41 | +``` |
| 42 | + |
| 43 | +# Preprocessing |
| 44 | + |
| 45 | +**Question:** *Is any preprocessing necessary or useful? Argue why a log-transformation of the response variable can be useful.* |
| 46 | + |
| 47 | +We don't have any missing values to handle. |
| 48 | + |
| 49 | +But a log-transformation of the response variable `Apps` can be useful for: |
| 50 | + |
| 51 | +- Fixing skewed data: Log-transformation can help reduce this right-skewness and make the distribution more symmetric. The summary statistics show that the `Apps` variable is highly skewed, with a median (1558) much lower than the mean (3002) and a very large maximum value (48094) compared to the minimum (81). |
| 52 | +- Scale compression: Log-transformation compresses the scale of variables that span several orders of magnitude, as is the case with Apps (from 81 to 48094). |
| 53 | +- Variance stabilization, outlier impact reduction: Log transformations are useful when you have data that's spread out over a wide range. It helps bring very large numbers closer together and spreads out very small numbers, making the data easier to work with. |
| 54 | +- Linearization of relationships: Log transforming can often make relationships simpler and more linear, which can improve the fit of linear regression models. |
| 55 | +- Interpretation in terms of relative changes: When using log-transformed data, coefficients can be interpreted in terms of percentage changes rather than absolute changes. This can be more intuitive sometimes. |
| 56 | + |
| 57 | +There are some caveats to keep in mind however: |
| 58 | + |
| 59 | +- It doesn't make sense for all variables. |
| 60 | +- After log transforming, you'll need to interpret your results differently (e.g., transforming back to the original scale). |
| 61 | +- You can't log transform zero or negative numbers directly. |
| 62 | + |
| 63 | +```{r, echo=TRUE, fig.width=15, fig.height=4} |
| 64 | +data(College) |
| 65 | +
|
| 66 | +# no values to drop |
| 67 | +cat("num of missing vals: ", sum(is.na(College)), "\n") |
| 68 | +
|
| 69 | +# log transform fixes skewness |
| 70 | +before <- ggplot(College, aes(Apps, Grad.Rate)) + geom_point() + theme_minimal() + labs(title = "Applications vs. Graduation Rate") |
| 71 | +after <- ggplot(College, aes(log(Apps), Grad.Rate)) + geom_point() + theme_minimal() + labs(title = "Log(Applications) vs. Graduation Rate") |
| 72 | +print(before + after) |
| 73 | +summary(College$Apps) |
| 74 | +``` |
| 75 | + |
| 76 | +# Training Full model |
| 77 | + |
| 78 | +**Question:** *Interpret the outcome of `summary(res)`, where `res` is the output from the `lm()` function. Which variables contribute to explaining the response variable?* |
| 79 | + |
| 80 | +The significance codes are denoted by the number of stars next to the variable names. They show the significance to explaining the response variable `log(Apps)` in the model. Some are highly significant with p-values as low as 2e-16, while others are significant with p-values less than 0.01. |
| 81 | + |
| 82 | +The model has an Adjusted R-squared of 0.7155, indicating that it explains about 71.55% of the variance in `log(Apps)`. |
| 83 | + |
| 84 | +**Question:** *Look at diagnostics plots with `plot(res)`. Are the model assumptions fulfilled?* |
| 85 | + |
| 86 | +- Residuals vs Fitted: There's a slight curve in the red line, suggesting some non-linearity might be present. |
| 87 | +Q-Q Residuals: The residuals mostly follow the diagonal line, indicating approximate normality, with some deviation at the tails. |
| 88 | +- Scale-Location: There's a slight upward trend, suggesting some heteroscedasticity (meaning the spread of errors / residuals is not constant across all values of the predictor variables). |
| 89 | +- Residuals vs Leverage: No points fall outside Cook's distance (dotted lines), indicating no highly influential points (outliers) in the data. |
| 90 | + |
| 91 | +While not perfect, the model assumptions are reasonably well met, with some minor concerns about non-linearity and heteroscedasticity. |
| 92 | + |
| 93 | +**Question:** *How is R handling binary variables (Private), and how can you interpret the corresponding regression coefficient?* |
| 94 | + |
| 95 | +R handles the binary variable `Private` using dummy coding. The coefficient for `PrivateYes` represents the average difference in `log(Apps)` between private and public institutions, holding other variables constant. Private institutions are associated with approximately $\exp(-0.630) = 0.532 = 53\%$ fewer applications compared to public institutions, all else being equal. |
| 96 | + |
| 97 | +**Question:** *Compare the resulting coefficients with those obtained from `lm()`. Do you get the same result?* |
| 98 | + |
| 99 | +The test case passes with a tolerance of 1e-10. |
| 100 | + |
| 101 | +Note: We get the warning "names for current but not for target" which means they are not in the same order - but that doesn't matter for the comparison. |
| 102 | + |
| 103 | +**Question:** *What do you think about the prediction performance of your model (using a graphical representation)?* |
| 104 | + |
| 105 | +The scatter plots of observed vs. predicted applications show a strong positive correlation between observed and predicted values for both training and test data. |
| 106 | + |
| 107 | +We can also see that the model tends to underpredict for very high numbers of applications, especially in the test data. Prediction accuracy seems slightly better for the training data compared to the test data, which is expected. |
| 108 | + |
| 109 | +**Question:** *Compute the RMSE separately for training and test data, and compare the values. What do you conclude?* |
| 110 | + |
| 111 | +The model didn't generalize well to the test data, with a difference of $\approx 911$ between the training and test RMSE values. |
| 112 | + |
| 113 | +```{r, echo=TRUE, fig.width=15, fig.height=8} |
| 114 | +data(College) |
| 115 | +
|
| 116 | +# holdout 2/3 split |
| 117 | +train_index <- sample(1:nrow(College), 2/3 * nrow(College)) |
| 118 | +train_data <- College[train_index, ] |
| 119 | +test_data <- College[-train_index, ] |
| 120 | +
|
| 121 | +# fit model |
| 122 | +full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data) |
| 123 | +summary(full_model) |
| 124 | +par(mfrow=c(2,2)) |
| 125 | +plot(full_model) |
| 126 | +
|
| 127 | +# solve analytically |
| 128 | +X <- model.matrix(~ . - Accept - Enroll - Apps, data = train_data) # model matrix X |
| 129 | +y <- log(train_data$Apps) # response vector y |
| 130 | +beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y # get LS estimate of beta analytically |
| 131 | +all.equal(as.vector(beta_hat), coef(full_model), 1e-10) # compare coeffs |
| 132 | +
|
| 133 | +# get rmse error |
| 134 | +train_error <- rmse(train_data$Apps, exp(predict(full_model))) |
| 135 | +test_error <- rmse(test_data$Apps, exp(predict(full_model, newdata = test_data))) |
| 136 | +cat("train RMSE: ", train_error, "vs. test RMSE: ", test_error, "\n") |
| 137 | +``` |
| 138 | +```{r, echo=TRUE, fig.width=15, fig.height=3} |
| 139 | +# plot observed vs predicted |
| 140 | +train_plot_data <- data.frame(Observed = train_data$Apps, Predicted = exp(predict(full_model, newdata = train_data))) |
| 141 | +train_plot <- ggplot(train_plot_data, aes(x = Observed, y = Predicted)) + |
| 142 | + geom_point(alpha = 0.5) + |
| 143 | + geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + |
| 144 | + scale_x_log10() + scale_y_log10() + |
| 145 | + labs(title = "Training Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") + |
| 146 | + theme_minimal() |
| 147 | +test_plot_data <- data.frame(Observed = test_data$Apps, Predicted = exp(predict(full_model, newdata = test_data))) |
| 148 | +test_plot <- ggplot(test_plot_data, aes(x = Observed, y = Predicted)) + |
| 149 | + geom_point(alpha = 0.5) + |
| 150 | + geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + |
| 151 | + scale_x_log10() + scale_y_log10() + |
| 152 | + labs(title = "Test Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") + |
| 153 | + theme_minimal() |
| 154 | +combined_plot <- train_plot + test_plot + plot_layout(ncol = 2) + plot_annotation(title = "Observed vs Predicted Applications") |
| 155 | +print(combined_plot) |
| 156 | +``` |
| 157 | + |
| 158 | +# Training Reduced model |
| 159 | + |
| 160 | +**Question:** *Are now all input variables significant in the model? Why is this not to be expected in general?* |
| 161 | + |
| 162 | +In the reduced model, all input variables are now significant at the 0.05 level or better. |
| 163 | + |
| 164 | +However, this is not generally expected for several reasons: |
| 165 | + |
| 166 | +- Variable Selection Bias: By selecting only significant variables from the full model, we introduce a selection bias that can artificially inflate the significance of the remaining variables. |
| 167 | +- Overfitting: The reduced model may be overfitting to the specific dataset, capturing noise rather than true underlying relationships. |
| 168 | +- Multicollinearity (meaning that some variables are highly correlated with each other): Removing some variables can change the relationships between the remaining variables, potentially altering their significance levels. |
| 169 | + |
| 170 | +**Question:** *Compute the RMSE for the new model. What would we expect?* |
| 171 | + |
| 172 | +The reduced model's RMSE values are: 7549.269 (Train), 8459.517 (Test). These values are higher than those of the full model 4332.683 (Train), 2288.898 (Test). |
| 173 | + |
| 174 | +This increase in RMSE is expected because (1) the reduced model has fewer predictors, potentially losing some explanatory power and (2) the full model may have been capturing some important relationships that are now omitted. |
| 175 | + |
| 176 | +However, the reduced model could potentially generalize better to new data, despite the higher RMSE on the current test set given its lower complexity. |
| 177 | + |
| 178 | +**Question:** *Compare the two models with `anova()`. What can you conclude?* |
| 179 | + |
| 180 | +The ANOVA test comparing the reduced and full models yields a highly significant result (p-value < 0.001). This indicates that there is a statistically significant difference between the two models / the full model explains significantly more variance in the data than the reduced model. |
| 181 | + |
| 182 | +However, it's important to note that statistical significance doesn't always translate to practical significance. |
| 183 | + |
| 184 | +In conclusion the reduced model, while explaining less variance and performing worse on the test set might be more interpretable and generalizable. The choice between the two models would depend on the specific goals of the analysis, balancing between model complexity, interpretability and predictive power. |
| 185 | + |
| 186 | +```{r, echo=TRUE, fig.width=15, fig.height=4} |
| 187 | +data(College) |
| 188 | +
|
| 189 | +# holdout 2/3 split |
| 190 | +train_index <- sample(1:nrow(College), 2/3 * nrow(College)) |
| 191 | +train_data <- College[train_index, ] |
| 192 | +test_data <- College[-train_index, ] |
| 193 | +
|
| 194 | +# get coeffs and p-values < 0.05 |
| 195 | +full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data) |
| 196 | +summary_full <- summary(full_model) |
| 197 | +coef_summary <- summary_full$coefficients |
| 198 | +significant_vars <- rownames(coef_summary)[coef_summary[, "Pr(>|t|)"] < 0.05] |
| 199 | +significant_vars <- significant_vars[!grepl("^\\(Intercept\\)|Yes$", significant_vars)] # drop intercept |
| 200 | +reduced_formula <- as.formula(paste("log(Apps) ~", paste(significant_vars, collapse = " + "))) |
| 201 | +# fit reduced model |
| 202 | +reduced_model <- lm(reduced_formula, data = train_data) |
| 203 | +summary(reduced_model) |
| 204 | +
|
| 205 | +# plot observed vs predicted |
| 206 | +train_data_plot <- data.frame(Observed = train_data$Apps, Predicted = exp(predict(reduced_model, train_data))) |
| 207 | +p1 <- ggplot(train_data_plot, aes(x = Observed, y = Predicted)) + |
| 208 | + geom_point(alpha = 0.5) + |
| 209 | + geom_abline(intercept = 0, slope = 1, color = "red") + |
| 210 | + scale_x_log10() + scale_y_log10() + |
| 211 | + labs(title = "Training Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") + |
| 212 | + theme_minimal() |
| 213 | +test_data_plot <- data.frame(Observed = test_data$Apps, Predicted = exp(predict(reduced_model, test_data))) |
| 214 | +p2 <- ggplot(test_data_plot, aes(x = Observed, y = Predicted)) + |
| 215 | + geom_point(alpha = 0.5) + |
| 216 | + geom_abline(intercept = 0, slope = 1, color = "red") + |
| 217 | + scale_x_log10() + scale_y_log10() + |
| 218 | + labs(title = "Test Data: Observed vs Predicted (Log Scale)", x = "Observed Apps (log scale)", y = "Predicted Apps (log scale)") + |
| 219 | + theme_minimal() |
| 220 | +grid.arrange(p1, p2, ncol = 2) |
| 221 | +
|
| 222 | +# get rmse error |
| 223 | +train_error <- rmse(train_data$Apps, exp(predict(reduced_model))) |
| 224 | +test_error <- rmse(test_data$Apps, exp(predict(reduced_model, newdata = test_data))) |
| 225 | +cat("train RMSE: ", train_error, "vs. test RMSE: ", test_error, "\n") |
| 226 | +
|
| 227 | +# anova |
| 228 | +anova_result <- anova(reduced_model, full_model) |
| 229 | +print(anova_result) |
| 230 | +``` |
| 231 | + |
| 232 | +# Variable selection |
| 233 | + |
| 234 | +Finally we perform forward and backward selection to find the best model and visualize the results. |
| 235 | + |
| 236 | +```{r, echo=TRUE, fig.width=15, fig.height=8} |
| 237 | +data(College) |
| 238 | +College <- na.omit(College) |
| 239 | +
|
| 240 | +# holdout 2/3 split |
| 241 | +train_index <- sample(1:nrow(College), 2/3 * nrow(College)) |
| 242 | +train_data <- College[train_index, ] |
| 243 | +test_data <- College[-train_index, ] |
| 244 | +
|
| 245 | +# forward / backward selection |
| 246 | +empty_model <- lm(log(Apps) ~ 1, data = train_data) |
| 247 | +full_model <- lm(log(Apps) ~ . - Accept - Enroll, data = train_data) |
| 248 | +forward_model <- step(empty_model, direction = "forward", scope = formula(full_model), trace = 0) |
| 249 | +backward_model <- step(full_model, direction = "backward", trace = 0) |
| 250 | +
|
| 251 | +# rmse |
| 252 | +calculate_rmse <- function(model, data) { |
| 253 | + predictions <- exp(predict(model, newdata = data)) |
| 254 | + actual <- data$Apps |
| 255 | + rmse <- sqrt(mean((actual - predictions)^2)) |
| 256 | + return(rmse) |
| 257 | +} |
| 258 | +cat("forward selection RMSE:", calculate_rmse(forward_model, train_data), "(train) vs.", calculate_rmse(forward_model, test_data), "(test)\n") |
| 259 | +cat("backward selection RMSE:", calculate_rmse(backward_model, train_data), "(train) vs.", calculate_rmse(backward_model, test_data), "(test)\n") |
| 260 | +
|
| 261 | +# plot |
| 262 | +create_plot <- function(model, data, title) { |
| 263 | +predictions <- exp(predict(model, newdata = data)) |
| 264 | +ggplot(data.frame(observed = data$Apps, predicted = predictions), aes(x = observed, y = predicted)) + |
| 265 | + geom_point(alpha = 0.5, color = "black") + |
| 266 | + geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + scale_x_log10() + scale_y_log10() + |
| 267 | + labs(title = title, x = "Observed (log scale)", y = "Predicted (log scale)") + |
| 268 | + theme_minimal() + |
| 269 | + theme(plot.title = element_text(size = 10, face = "bold")) |
| 270 | +} |
| 271 | +p1 <- create_plot(forward_model, train_data, "Forward Selection (Train)") |
| 272 | +p2 <- create_plot(forward_model, test_data, "Forward Selection (Test)") |
| 273 | +p3 <- create_plot(backward_model, train_data, "Backward Selection (Train)") |
| 274 | +p4 <- create_plot(backward_model, test_data, "Backward Selection (Test)") |
| 275 | +combined_plot <- (p1 + p2) / (p3 + p4) |
| 276 | +print(combined_plot) |
| 277 | +``` |
0 commit comments