14-wt-machine-learning.Rmd

# Walkthrough 8: Predicting students' final grades using supervised machine learning {#c14}

**Abstract** 

This chapter introduces a different type of statistical model that is increasingly common -- machine learning. Specifically, we focus on supervised machine learning, which involves, first, identifying an outcome (another name for a dependent variable), and then building (or training) a model to predict that outcome. Some supervised machine learning models are highly complex, while others are simple. To illustrate and gain experience training and interpreting a supervised machine learning model, this chapter involves predicting whether students will pass a class using a simple model -- a generalized linear model. The Open University Learning Analytics Dataset (OULAD) is used as an example of the type of data used in learning analytics. The tidymodels collection of packages is used to carry out each of the principal supervised machine learning steps. At the conclusion, ways to build more complex models are discussed.

## Topics emphasized

- Processing data
- Modeling data

## Functions introduced

- initial_split()
- training()
- testing()
- recipe()
- logistic_reg()
- set_model()
- set_mode()
- workflow()
- collect_predictions()
- collect_metrics()

## Vocabulary

- supervised machine learning
- training data
- testing data
- logistic regression
- classification
- predictions
- metrics

## Chapter overview

### Background

In a face-to-face classroom, teachers rely on cues from students to help them to respond and engage their students. But, online, educators do not have ready access to such cues---at least, not the same cues. 

For example, in a face-to-face class, cues like students seeming distracted during a lecture can change what teachers do next. Many online educators find themselves looking for ways to understand and support students online in the same way that face-to-face instructors would. One affordance of technology is that it provides new methods of collecting and storing data that can, potentially, serve as the basis for different kinds of cues. 

Online Learning Management Systems (LMSs) often automatically track several types of student interactions with the system---and feed that data back to the course instructor. The collection of this data is often met with mixed reactions from educators. Some are concerned that data collection in this manner is intrusive, but others see a new opportunity to support students in online contexts in new ways. As long as data are collected and utilized responsibly, data collection can support student success.

In this walkthrough, we examine the question, *How well can we predict students who are at risk of dropping a course?*. To answer this question, we use what is typical of learning analytics data---student records, assessment outcomes, course outcomes, and, critically, measures of students' interactions with the course.

Here's the key supervised machine learning shift. We focus on *predicting* an outcome--whether a student passes a course--more than *explaining* how variables relate to an outcome, such as how the amount of time students spend on the course relates to their final grade. We do so with a relatively simple model, a generalized linear regression model, though we conclude with a nod to other, more sophisticated models.

### Data sources

We'll be using a widely-used data set in the learning analytics field that is also emblematic of this unique type of data: the Open University Learning Analytics Dataset (OULAD). The OULAD was created by learning analytics researchers at the United Kingdom-based Open University [@kuzilek2017]. It includes data from post-secondary learners participation in one of several Massive Open Online Courses (called *modules* in the OULAD).

Many students successfully complete these courses, but not all do, highlighting the importance of identifying those who may be at risk

### Methods

#### Predictive analytics and supervised machine learning

A buzzword in education software spheres these days is "predictive analytics". Administrators and educators alike are interested in applying the methods long utilized by marketers and other business professionals to try to determine what a person will want, need, or do next. "Predictive analytics" is a blanket term that can be used to describe any statistical approach that yields a prediction. We could ask a predictive model: "What is the likelihood that my cat will sit on my keyboard today?" and, given enough past information about your cat's computer-sitting behavior, the model could give you a probability of that computer-sitting happening today. Under the hood, some predictive models are not very complex. 

If we have an outcome with two possibilities, a logistic regression model could be fit to the data in order to help us answer the cat-keyboard question. In this chapter, we'll compare a machine learning model to another type of regression: multiple regression. We want to make sure to fit the simplest model as possible to our data.

There is an adage: "garbage in, garbage out". This holds true here. If we do not feel confident that the data we collected are accurate, we will not be able to be confident in our conclusions no matter what model we build. To collect good data, we must first clarify what it is that we want to know (i.e., what question are we really asking?) and what information we would need in order to effectively answer that question. Sometimes, people approach analysis from the opposite direction---they might look at the data they have and ask what questions could be answered based on that data. That approach is okay, as long as you are willing to acknowledge that sometimes the pre-existing dataset may *not* contain all the information you need, and you might need to go out and find additional information to add to your dataset to truly answer your question.

When people talk about "machine learning", you might get the image in your head of a desktop computer learning how to spell. You might picture your favorite social media site showing you advertisements that are just a little too accurate. At its core, machine learning is the process of "showing" your statistical model only some of the data at once and training the model to predict accurately on that training dataset (this is the "learning" part of machine learning). Then, the model as developed on the training data is shown new data---data you had all along, but hid from your computer initially---and you see how well the model that you developed on the training data performs on this new testing data. Eventually, you might use the model on entirely new data.

Let's dive in to the analysis, starting with something you're familiar with---loading packages!

## Load packages

As always, if you have not installed any of these packages before, do so first using the `install.packages()` function. For a description of packages and their installation, review the [Packages](#c06p) section of the [Foundational Skills](#c06) chapter.

```{r, message = FALSE, warning = FALSE}
# load the packages
library(tidyverse)
library(tidymodels)
```

#### Reading CSV Data

To begin, we will load student-level data using the `read_csv()` function. This dataset has undergone minimal preprocessing to help streamline our analysis -- it integrates information from three sources: `studentInfo`, `courses`, and `studentRegistration`, all of which relate to students (and courses they took).

Additionally, we will load an assessments dataset. This provides data on students' performance on various assessments throughout their courses. 

```{r}
students <- read_csv("data/ml/oulad-students.csv")
assessments <- read_csv("data/ml/oulad-assessments.csv")
```

### Preprocessing and Feature Engineering

Our goal is to build a model that predicts whether a student is at risk of dropping out. We will handle some feature engineering directly, below. Later, we will show you how to do similar steps in what is called the "recipe" in the supervised machine learning workflow. How to decide when to take these steps? It depends in part on how straightforward it is to do what you want to within the recipe. Another reason to take these steps within the recipe is to avoid an issue called "data leakage", which can bias a model. For simple steps, this may not be an issue, but for many others it is something to aware of.

#### Step 1: Creating Outcome and Predictor Variables Outside the Recipe

To begin, we create the outcome variable (`pass`) and a factor variable for `disability` directly using `mutate()`:

```{r}
students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>%
    mutate(pass = as.factor(pass),
           disability = as.factor(disability))
```

We will also summarize assessment data to create a new predictor based on students’ performance on assessments submitted early in the course. Specifically, we will calculate the mean weighted score of assessments submitted before the first half of assignment dates.

```{r}
code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .5, na.rm = TRUE))

assessments_joined <- assessments %>% 
    left_join(code_module_dates) %>% 
    filter(date < quantile_cutoff_date) %>% 
    mutate(weighted_score = score * weight) %>% 
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score, na.rm = TRUE))

```

Last, we will create a socioeconomic status variable (`imd_band`), again outside the recipe:

```{r}
students <- students %>%
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>%
    mutate(imd_band = as.factor(imd_band))
```

Next, we'll load a new file --- one with *interactions* (or log-trace) data that is the most granular data in the OULAD. In the OULAD documentation, this is called the VLE (virtual learning environment) data source. It's a large file---even after we've taken a few steps to reduce its size (namely, only including interactions for the first one-third of the class).

Let's import this dataset:

```{r}
interactions <- read_csv("data/ml/oulad-interactions-filtered.csv") # need to unzip in the repository
```

Let's get a handle on the data by exploring it a bit.

*First*, let's `count()` the `activity_type` variable and *sort* the resulting output by frequency.

```{r}
interactions %>% 
    count(activity_type, sort = TRUE)
```

We can see there are a range of activities students interact with. You may wish to try explore this data in other ways---there are many ways we can include this data in our model, and we are just scratching the surface with how we do so here.


How can we create a feature with `sum_click`? Think back to our
discussion in the presentation; we have *many* options for working with
such time series data. Perhaps the most simple is to count the clicks.
Please summarize the number of clicks for each student (specific to a
single course). This means you will need to group your data by
`id_student`, `code_module`, and `code_presentation`, and then create a
summary variable.

```{r}
interactions_summarized <- interactions %>% 
    group_by(id_student, code_module, code_presentation) %>% 
    summarize(sum_clicks = sum(sum_click))

interactions_summarized
```

How many times did students click? Let's create a histogram to see.
Please use {ggplot} and `geom_histogram()` to visualize the distribution
of the `sum_clicks` variable you just created.

```{r}

interactions_summarized %>% 
    ggplot(aes(x = sum_clicks)) +
    geom_histogram()
```

This is a good start - we've created our first feature based upon the
log data, `sum_clicks`! What are some other features we can add? An affordance
of using the `summarize()` function in R is we can create multiple
summary statistics at once. Let's also calculate the standard deviation
of the number of clicks as well as the mean. Please copy the code you
wrote above into the code chunk below and then add these two additional
summary statistics.

```{r}
interactions_summarized <- interactions %>% 
    group_by(id_student, code_module, code_presentation) %>% 
    summarize(sum_clicks = sum(sum_click),
              sd_clicks = sd(sum_click), 
              mean_clicks = mean(sum_click))
```

Let's join together *all* of the data we'll use for
our modeling: `students`, `assessments_joined`, and `intteractions_summarized`. 

Use `left_join()` twice, assigning the resulting output the name
`students_and_interactions`. 

Lots of joining! Sometimes, the hardest part of complex analyses lies in the preparation (and joining) of the data.

```{r}
students_and_interactions <- students %>% 
    left_join(assessments_joined) %>% 
    left_join(interactions_summarized)
```

#### Step 2: Splitting the Data

As suggested above, a key step in supervised machine learning is splitting out data into "training" and "testing" data sets. We use the training data set to train---build---the model. Then we use the test data set to evaluate the performance. More specifically, after training the model, we use that model with the test data---specifically, we use the predictor variables to predict the test set outcome (here, passing the class). Since we haven't used the test data set to train the model, we can evaluate how well the model would predict the outcome in data that has not been used before to develop the model. In other words, we can evaluate how well the model would do with new data.

We split the dataset into training and testing sets using an 80-20 split. Generally, a split such as this is appropriate with a larger data set; as your data set becomes smaller, something closer to a 50-50 split may be more appropriate. We also conduct a stratified sample using the outcome variable --- here, `pass`; this is generally a good practice [@boehmke2019hands].

```{r}
set.seed(2025) # as this step involves a random sample, this ensures the same result for pedagogical purposes

# specify the proportion for the split
train_test_split <- initial_split(students_and_interactions, prop = 0.8, strata = "pass")

data_train <- training(train_test_split) # create the training data
data_test  <- testing(train_test_split) # create the testing data
```

#### Step 3: Creating a Recipe for Selected Preprocessing Steps

Here is the recipe step we metioned earlier. We do two things here. 

First, we specify which predictor variables predict the outcome. For those familiar with the `lm()` function in R, this behaves similarly; the outcome goes on the left side of the `~`, and predictors go on the right. As you can see, the data here is the training data.

Second, we use `step_()` functions, those used for preprocessing, described in comments below.

```{r}
my_rec <- recipe(pass ~ disability + imd_band + mean_weighted_score + 
                     num_of_prev_attempts + gender + region + highest_education +
                     sum_clicks + 
                     sd_clicks +
                     mean_clicks,
                 data = data_train) %>%
    # this step is to impute missing values
    step_impute_mean(mean_weighted_score, sum_clicks, sd_clicks, mean_clicks) %>%
    # same, for categorical/factor variables
    step_impute_mode(imd_band) %>% 
    # center and scale these variables
    step_center(mean_weighted_score, num_of_prev_attempts) %>%
    step_scale(mean_weighted_score, num_of_prev_attempts) %>%
    # dummy code all categorical/factor predictors
    step_dummy(all_nominal_predictors(), -all_outcomes())
```

We can inspect the recipe to verify the steps we have specified:

```{r}
my_rec
```

#### Step 4: Specifying the Model and Workflow

Next, we specify a logistic regression model and bundle the recipe and model into a workflow. 

This step has a lot of pieces, but they are fairly boilerplate. The first part is to specify the model:

```{r}
my_mod <- logistic_reg() %>% # specifies the type of model
    set_engine("glm") %>% # specifies the package we use to estimate the model
    set_mode("classification") # specifies whether we are classifying a dichotomous, categorical, or factor variable
```

The next is to specify the workflow, which basically stitches all of the parts together --- the recipe and the model we just specified.

```{r}
my_wf <- workflow() %>% # this initiates the workflow
    add_recipe(my_rec) %>% 
    add_model(my_mod)
```

Almost there!

#### Step 5: Fitting the Model

Now we fit the model to the training data. First, we need to specify here which _metrics_ we want to calculate---statistics that help us understand how good our model is at making predictions. We do this with `metric_set()`, specifying the familiar accuracy as well as precision (how and recall.

```{r}
my_metrics <- metric_set(accuracy , precision, recall)
```

Finally, we fit the model. We do this by calling the `last_fit()` function on the workflow and the _split_ specification of the data. We also specify which metrics we specified. 

We note that there are other fitting functions you might come across, or use. We use `last_fit` as a bit of a handy way of doing the two critical steps at once: fitting the model to the training data, and then using the test data as the basis for evaluating its performance. So, there is a lot happening in this one line of code! 

```{r}
final_fit <- last_fit(my_wf, train_test_split, metrics = my_metrics)
```

Other times, you may use what is called _cross-validation_, which is where you split the training data many times and fit the model to each of these splits---only examining the performance of the model with the test data after you have trained your last model. For brevity, we do not do that here; consider reading chapter 2 in [@boehmke2019hands] for more on this.

#### Step 6: Evaluating Model Performance

Finally, we evaluate the model’s performance using the test set. The tidymodels package makes this easy:

```{r}
metrics <- final_fit %>%
    collect_metrics()

metrics
```

So, how did the model do? Let's focus on accuracy for now: we can see the model correctly predicted whether students passed around 64% of the time; this is the percentage of the time the model made correct predictions. 

But, the model seemed to perform differnetly for students who passed versus those who did not. We can see this by looking at the precision and recall metrics: the recall value of .925 tells us that when students passed the course, the model correctly predicted they did so around 92% of the time. The precision, though, tells us that when the model predicted a student passed the course, it was correct around 65% of the time, meaning that the model regularly made _false positive_ predictions. Herein lies the value of metrics other than accuracy: they can help us understand how the model is performing for different outcomes: false positives or false negatives may matter more or less depending on the context, and your knowledge as the analyst and researcher is critical here for determining whether the model is "good enough" for your purposes.

On that note, how could we improve the model? One affordance of tidymodels is we can readily switch out the model and engine. Try one of these modifications to the code to see how the predictive performance improves---for a random forest and a boosted tree model, respectively, and then re-running the subsequent steps. How much better did the model do with these? We'll leave that to you to figure out!

```{r, eval = FALSE}
my_mod <- rand_forest() %>%
    set_engine("ranger") %>% # install.packages("ranger") needed first
    set_mode("classification")

my_mod <- boost_tree() %>% # install.packages("xgboost") needed first
  set_engine("xgboost") %>% 
  set_mode("classification")
```

## Conclusion

Though we focus on this relatively simple model, or algorithm, many of the ideas explored in this chapter will likely extend and prove useful for other machine learning methods. Our goal is for you to finish this final walkthrough with the confidence to explore using machine learning to answer a question or to solve a problem of your own with respect to teaching, learning, and educational systems.

In this chapter, we introduced general machine learning ideas, like training and test datasets and evaluating the importance of specific variables, in the context of predicting students' passing a course. Like many of the topics in this book---but, perhaps *particularly* so for machine learning---there is much more to discover on the topic, and we encourage you to consult the books and resources in the [Learning More chapter](#c17) to learn about further applications of machine learning methods.