SuperLearner underperforming best learner? #154

DanielPark-MGH · 2024-11-13T22:44:46Z

Hi, I'm getting a result that seems unintuitive. Could I ask if this is a possible result or if there could be a bug?

I trained a CV.SuperLearner:

num_cores <- RhpcBLASctl::get_num_cores()
num_folds_cvSL <- 10
options(mc.cores = num_cores - 1)
set.seed(1, "L'Ecuyer-CMRG")

# enet <- create.Learner()

sl_lib <- c(
  "SL.mean",
  "SL.lm",
  "SL.glmnet",
  "SL.ranger"
  )

cv_sl <- CV.SuperLearner(
  Y = y.train,
  X = X.train,
  obsWeights = wts_obs,
  cvControl = list(
    V = num_folds_cvSL
    ),
  parallel = "multicore",
  family = gaussian(),
  SL.library = sl_lib,
  verbose = TRUE
  )

Then summary(cv_sl) shows:

> summary(cv_sl)

Call:  
CV.SuperLearner(Y = y.train, X = X.train, family = gaussian(), SL.library = sl_lib, verbose = TRUE,  
    cvControl = list(V = num_folds_cvSL), obsWeights = wts_obs, parallel = "multicore") 

Risk is based on: Mean Squared Error

All risk estimates are based on V =  10 

     Algorithm       Ave         se      Min       Max
 Super Learner 0.0751603 1.3361e-04 0.070365 0.0799643
   Discrete SL 0.2240814 3.1295e-04 0.220039 0.2296513
   SL.mean_All 0.2473175 1.4485e-05 0.241507 0.2526675
     SL.lm_All 0.2246898 3.4062e-04 0.220455 0.2304481
 SL.glmnet_All 0.2240814 3.1295e-04 0.220039 0.2296513
 SL.ranger_All 0.0080468 1.5909e-04 0.007421 0.0087511

However, this doesn't seem to agree with the results of

lapply(
  cv_sl$AllSL,
  function(sl) {sl$cvRisk}
  ) %>% 
  do.call(rbind, .) %>%
  colMeans(.)

SL.mean_All     SL.lm_All SL.glmnet_All SL.ranger_All 
    0.2474629     0.2231949     0.2219471     0.4422828

Question 1: How is the average risk of SL.ranger_ALL = 0.008 from summary(cv_sl) when the empirical average across all the SuperLearners in cv_sl = 0.442?
Question 2: If the average risk of SL.ranger_ALL really = 0.008, why does the Super Learner from cv_sl have higher average risk than the supposed best performing learner?

The text was updated successfully, but these errors were encountered:

ecpolley · 2024-11-14T02:01:11Z

That does look a bit odd, but can you provide a few additional details. For example, I noticed you have weights (wts_obs), so instead of colMeans() you'll need to use something like weighted.mean(). And with that in mind, what is the distribution of the weights? And what is the distribution of the outcome, does it have some outliers. Can you also share cv_sl$whichDiscreteSL and cv_sl$coef

DanielPark-MGH · 2024-11-14T16:34:51Z

Sure @ecpolley This is the distribution of the observation weights. There are 2 extremely imbalanced classes.

> table(wts_obs) / length(wts_obs)
wts_obs
0.502571330206137  97.7259414225941 
      0.994883651       0.005116349

This is the distribution of the outcome. I'm predicting the residuals from a regression decomposition, but the original outcome was binary {0, 1}.

> summary(y.train)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.0223121 -0.0079120 -0.0033595  0.0000077 -0.0011750  0.9996352

Here's some more info from the cv_sl object:

> table(simplify2array(cv_sl$whichDiscreteSL))

SL.glmnet_All 
           10 

> cv_sl$coef
   SL.mean_All SL.lm_All SL.glmnet_All SL.ranger_All
1   0.04736922         0     0.4753922     0.4772386
2   0.05113086         0     0.5053073     0.4435618
3   0.04805716         0     0.5086328     0.4433101
4   0.04778061         0     0.5032869     0.4489325
5   0.02751555         0     0.5285373     0.4439472
6   0.03267183         0     0.5001860     0.4671422
7   0.04865446         0     0.4960450     0.4553005
8   0.03250135         0     0.5108070     0.4566917
9   0.05804477         0     0.4880781     0.4538771
10  0.03789261         0     0.4847758     0.4773316

So SL.glmnet was voted the best learner in all 10 external CV iterations right? That's why I'm confused about SL.ranger average risk.

Regarding your comment about colMeans() vs weighted.mean(), I thought I could take the simple average of each learner's risk across external CV iterations?

ecpolley · 2024-11-14T17:38:14Z

No, the risk is estimated as the weighted average loss function. If you set the weights to 1 for everyone, do you still get an odd result? I suspect there might be something going on with the weights being bimodal and so different.

DanielPark-MGH · 2024-11-14T19:08:40Z

I retrained the CV.SuperLearner without the obsWeights argument. These summaries look more like what I would expect.

> summary(cv_sl)

Call:  
CV.SuperLearner(Y = y.train, X = X.train, family = gaussian(), SL.library = sl_lib, verbose = TRUE,  
    cvControl = list(V = num_folds_cvSL), parallel = "multicore") 

Risk is based on: Mean Squared Error

All risk estimates are based on V =  10 

     Algorithm       Ave         se       Min       Max
 Super Learner 0.0050505 0.00016085 0.0040707 0.0061188
   Discrete SL 0.0050523 0.00016100 0.0040711 0.0061221
   SL.mean_All 0.0050623 0.00016162 0.0040797 0.0061381
     SL.lm_All 0.0050527 0.00016095 0.0040717 0.0061215
 SL.glmnet_All 0.0050523 0.00016100 0.0040711 0.0061221
 SL.ranger_All 0.0051483 0.00015971 0.0041885 0.0062019

> table(simplify2array(cv_sl$whichDiscreteSL))

SL.glmnet_All 
           10 

> cv_sl$coef
   SL.mean_All  SL.lm_All SL.glmnet_All SL.ranger_All
1            0 0.17401399     0.7316625    0.09432355
2            0 0.12414772     0.7573379    0.11851441
3            0 0.08320617     0.8036499    0.11314388
4            0 0.00000000     0.9186788    0.08132120
5            0 0.12820404     0.7721348    0.09966117
6            0 0.00000000     0.8957660    0.10423398
7            0 0.00000000     0.8815002    0.11849980
8            0 0.00000000     0.9239722    0.07602782
9            0 0.00000000     0.8954353    0.10456469
10           0 0.00000000     0.8819275    0.11807253

DanielPark-MGH · 2024-11-14T19:10:49Z

@ecpolley I'll take a look at the source code to see how that obsWeights parameter is being used. It might be the cause of the summary in my first post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperLearner underperforming best learner? #154

SuperLearner underperforming best learner? #154

DanielPark-MGH commented Nov 13, 2024

ecpolley commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024

ecpolley commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024

SuperLearner underperforming best learner? #154

SuperLearner underperforming best learner? #154

Comments

DanielPark-MGH commented Nov 13, 2024

ecpolley commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024

ecpolley commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024

DanielPark-MGH commented Nov 14, 2024