-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-class Classification #16
Comments
I haven't been working on adding multi-class classification capabilities to the existing code. In practice, I often split the multi-class problem into a collection of binary classification problems. Say you have 3 classes (A, B, and C), you could fit binary classifiers for A vs B or C, B vs A or C, and C vs A or B then combine the results to make a classification passed on the highest probability estimate. The probability estimates are not correct because they are not contained to sum to 1, but the approach does allow flexibility for the classifier for the different categories. Here is a quick example ## multi-class classification
library(SuperLearner)
set.seed(843)
N <- 100
# outcome
Y <- sample(c("A", "B", "C"), size = N, replace = TRUE, prob = c(.1, .5, .4))
# variables
X1 <- rnorm(n = N, mean = (as.numeric(Y == "A") + .5*(as.numeric(Y == "C"))), sd = 1)
X2 <- rnorm(n = N, mean = (as.numeric(Y == "B")), sd = 1)
X3 <- rnorm(n = N, mean = (-1*as.numeric(Y == "B" | Y == "C")), sd = 1)
X4 <- rnorm(n = N, mean = X2, sd = 1)
X5 <- rnorm(n = N, mean = (X1*as.numeric(Y == "A") + as.numeric(Y == "A" | Y == "C")), sd = 1)
DAT <- data.frame(X1, X2, X3, X4, X5)
# test Data
# outcome
M <- 10000
Y_test <- sample(c("A", "B", "C"), size = M, replace = TRUE, prob = c(.1, .5, .4))
# variables
X1_test <- rnorm(n = M, mean = (as.numeric(Y_test == "A") + .5*(as.numeric(Y_test == "C"))), sd = 1)
X2_test <- rnorm(n = M, mean = (as.numeric(Y_test == "B")), sd = 1)
X3_test <- rnorm(n = M, mean = (-1*as.numeric(Y_test == "B" | Y_test == "C")), sd = 1)
X4_test <- rnorm(n = M, mean = X2_test, sd = 1)
X5_test <- rnorm(n = M, mean = (X1_test*as.numeric(Y_test == "A") + as.numeric(Y_test == "A" | Y_test == "C")), sd = 1)
DAT_test <- data.frame(X1 = X1_test, X2 = X2_test, X3 = X3_test, X4 = X4_test, X5 = X5_test)
# figure
# library(GGally)
# DAT2 <- data.frame(Y, DAT)
# ggpairs(DAT2, color = "Y")
# create the 3 binary variables
Y_A <- as.numeric(Y == "A")
Y_B <- as.numeric(Y == "B")
Y_C <- as.numeric(Y == "C")
# simple library, should include more classifiers
SL.library <- c("SL.gbm", "SL.glmnet", "SL.glm", "SL.knn", "SL.gam", "SL.mean")
# least squares loss function
fit_A <- SuperLearner(Y = Y_A, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_B <- SuperLearner(Y = Y_B, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_C <- SuperLearner(Y = Y_C, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
SL_pred <- data.frame(pred_A = fit_A$SL.predict[, 1], pred_B = fit_B$SL.predict[, 1], pred_C = fit_C$SL.predict[, 1])
Classify <- apply(SL_pred, 1, function(xx) c("A", "B", "C")[unname(which.max(xx))])
table(Classify, Y_test) |
Multi-class classification is something I have thought about adding. A reasonable way to implement this is using multiple response linear regression (MLR). Details in this paper: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume10/ting99a.pdf |
You should be able to optimize weights of different models given the multi-class logloss function right? |
Yes, if each base learner in the library output a vector of predicted probabilities for the classes, you could define a convex combination of the predicted probabilities based on minimizing the V-fold cross-validated multi-class log loss estimate. Can you suggest some examples for base learners that return probability vectors? |
Sorry for the long delay. Here are a couple. randomForest is probably easiest.
|
Polymars was also designed specifically for multiple classification (http://projecteuclid.org/euclid.aos/1031594728 part 6 on "polyclass"). |
Looks like the code for a bunch of wrappers already exists! We just need to integrate it: |
Was this ever implemented? I keep running into errors when trying it out with SL.glmnet. The links ck37 posted are unfortunately down. |
Same question; would be great if this were an option |
Is there any work on adding multi-class classification capabilities? Maybe we could start something with gbm.
The text was updated successfully, but these errors were encountered: