Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SuperLearner underperforming best learner? #154

Open
DanielPark-MGH opened this issue Nov 13, 2024 · 5 comments
Open

SuperLearner underperforming best learner? #154

DanielPark-MGH opened this issue Nov 13, 2024 · 5 comments

Comments

@DanielPark-MGH
Copy link

Hi, I'm getting a result that seems unintuitive. Could I ask if this is a possible result or if there could be a bug?

I trained a CV.SuperLearner:

num_cores <- RhpcBLASctl::get_num_cores()
num_folds_cvSL <- 10
options(mc.cores = num_cores - 1)
set.seed(1, "L'Ecuyer-CMRG")

# enet <- create.Learner()

sl_lib <- c(
  "SL.mean",
  "SL.lm",
  "SL.glmnet",
  "SL.ranger"
  )

cv_sl <- CV.SuperLearner(
  Y = y.train,
  X = X.train,
  obsWeights = wts_obs,
  cvControl = list(
    V = num_folds_cvSL
    ),
  parallel = "multicore",
  family = gaussian(),
  SL.library = sl_lib,
  verbose = TRUE
  )

Then summary(cv_sl) shows:

> summary(cv_sl)

Call:  
CV.SuperLearner(Y = y.train, X = X.train, family = gaussian(), SL.library = sl_lib, verbose = TRUE,  
    cvControl = list(V = num_folds_cvSL), obsWeights = wts_obs, parallel = "multicore") 

Risk is based on: Mean Squared Error

All risk estimates are based on V =  10 

     Algorithm       Ave         se      Min       Max
 Super Learner 0.0751603 1.3361e-04 0.070365 0.0799643
   Discrete SL 0.2240814 3.1295e-04 0.220039 0.2296513
   SL.mean_All 0.2473175 1.4485e-05 0.241507 0.2526675
     SL.lm_All 0.2246898 3.4062e-04 0.220455 0.2304481
 SL.glmnet_All 0.2240814 3.1295e-04 0.220039 0.2296513
 SL.ranger_All 0.0080468 1.5909e-04 0.007421 0.0087511

However, this doesn't seem to agree with the results of

lapply(
  cv_sl$AllSL,
  function(sl) {sl$cvRisk}
  ) %>% 
  do.call(rbind, .) %>%
  colMeans(.)
SL.mean_All     SL.lm_All SL.glmnet_All SL.ranger_All 
    0.2474629     0.2231949     0.2219471     0.4422828

Question 1: How is the average risk of SL.ranger_ALL = 0.008 from summary(cv_sl) when the empirical average across all the SuperLearners in cv_sl = 0.442?
Question 2: If the average risk of SL.ranger_ALL really = 0.008, why does the Super Learner from cv_sl have higher average risk than the supposed best performing learner?

@ecpolley
Copy link
Owner

That does look a bit odd, but can you provide a few additional details. For example, I noticed you have weights (wts_obs), so instead of colMeans() you'll need to use something like weighted.mean(). And with that in mind, what is the distribution of the weights? And what is the distribution of the outcome, does it have some outliers. Can you also share cv_sl$whichDiscreteSL and cv_sl$coef

@DanielPark-MGH
Copy link
Author

Sure @ecpolley This is the distribution of the observation weights. There are 2 extremely imbalanced classes.

> table(wts_obs) / length(wts_obs)
wts_obs
0.502571330206137  97.7259414225941 
      0.994883651       0.005116349

This is the distribution of the outcome. I'm predicting the residuals from a regression decomposition, but the original outcome was binary {0, 1}.

> summary(y.train)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.0223121 -0.0079120 -0.0033595  0.0000077 -0.0011750  0.9996352

Here's some more info from the cv_sl object:

> table(simplify2array(cv_sl$whichDiscreteSL))

SL.glmnet_All 
           10 

> cv_sl$coef
   SL.mean_All SL.lm_All SL.glmnet_All SL.ranger_All
1   0.04736922         0     0.4753922     0.4772386
2   0.05113086         0     0.5053073     0.4435618
3   0.04805716         0     0.5086328     0.4433101
4   0.04778061         0     0.5032869     0.4489325
5   0.02751555         0     0.5285373     0.4439472
6   0.03267183         0     0.5001860     0.4671422
7   0.04865446         0     0.4960450     0.4553005
8   0.03250135         0     0.5108070     0.4566917
9   0.05804477         0     0.4880781     0.4538771
10  0.03789261         0     0.4847758     0.4773316

So SL.glmnet was voted the best learner in all 10 external CV iterations right? That's why I'm confused about SL.ranger average risk.

Regarding your comment about colMeans() vs weighted.mean(), I thought I could take the simple average of each learner's risk across external CV iterations?

@ecpolley
Copy link
Owner

No, the risk is estimated as the weighted average loss function. If you set the weights to 1 for everyone, do you still get an odd result? I suspect there might be something going on with the weights being bimodal and so different.

@DanielPark-MGH
Copy link
Author

I retrained the CV.SuperLearner without the obsWeights argument. These summaries look more like what I would expect.

> summary(cv_sl)

Call:  
CV.SuperLearner(Y = y.train, X = X.train, family = gaussian(), SL.library = sl_lib, verbose = TRUE,  
    cvControl = list(V = num_folds_cvSL), parallel = "multicore") 

Risk is based on: Mean Squared Error

All risk estimates are based on V =  10 

     Algorithm       Ave         se       Min       Max
 Super Learner 0.0050505 0.00016085 0.0040707 0.0061188
   Discrete SL 0.0050523 0.00016100 0.0040711 0.0061221
   SL.mean_All 0.0050623 0.00016162 0.0040797 0.0061381
     SL.lm_All 0.0050527 0.00016095 0.0040717 0.0061215
 SL.glmnet_All 0.0050523 0.00016100 0.0040711 0.0061221
 SL.ranger_All 0.0051483 0.00015971 0.0041885 0.0062019

> table(simplify2array(cv_sl$whichDiscreteSL))

SL.glmnet_All 
           10 

> cv_sl$coef
   SL.mean_All  SL.lm_All SL.glmnet_All SL.ranger_All
1            0 0.17401399     0.7316625    0.09432355
2            0 0.12414772     0.7573379    0.11851441
3            0 0.08320617     0.8036499    0.11314388
4            0 0.00000000     0.9186788    0.08132120
5            0 0.12820404     0.7721348    0.09966117
6            0 0.00000000     0.8957660    0.10423398
7            0 0.00000000     0.8815002    0.11849980
8            0 0.00000000     0.9239722    0.07602782
9            0 0.00000000     0.8954353    0.10456469
10           0 0.00000000     0.8819275    0.11807253

@DanielPark-MGH
Copy link
Author

@ecpolley I'll take a look at the source code to see how that obsWeights parameter is being used. It might be the cause of the summary in my first post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants