reduce the object size by trimming the split elements #930

topepo · 2024-08-15T00:27:02Z

I had a conversation at conf with someone who mentioned an issue I’ve had.

When you have a large data set or a workflow set with many different workflows, the resulting object can be very large in memory and on disk. Even though the tune_results object only keeps the original data once, that might be excessive (especially in a workflow map).

I want to test out an option to our control functions called trim_split (or similar) that can replace the data slot in the split objects with a zero-row slice and additionally make the integer indices integer(0). That should significantly reduce the size (barring a lot of out-of-sample predictions that might be saved). The split column stays a split column, and no classes are dropped from it (or the tune_results object).

This means that users would be unable to do anything meaningful with the split objects, but it is very unlikely that they would. Also, since it copies the original rset, they could fix this by replacing the altered split column with the one from the rset.

I don't see much downside.

Should the code to clean the split objects go into rsample?

The text was updated successfully, but these errors were encountered:

topepo · 2024-08-15T04:32:38Z

A quick example from an initial implementation:

library(tidymodels)

set.seed(6735)
folds <- vfold_cv(mtcars, v = 5)

spline_rec <- recipe(mpg ~ ., data = mtcars) %>%
  step_ns(disp) %>%
  step_ns(wt)

lin_mod <- linear_reg() %>%
  set_engine("lm")

control <- control_resamples(save_pred = TRUE, trim_splits = TRUE, save_workflow = TRUE)

spline_res <- fit_resamples(lin_mod, spline_rec, folds, control = control)

spline_res
#> # Resampling results
#> # 5-fold cross-validation 
#> # A tibble: 5 × 5
#>   splits        id    .metrics         .notes           .predictions    
#>   <list>        <chr> <list>           <list>           <list>          
#> 1 <split [0/0]> Fold1 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [7 × 4]>
#> 2 <split [0/0]> Fold2 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [7 × 4]>
#> 3 <split [0/0]> Fold3 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>
#> 4 <split [0/0]> Fold4 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>
#> 5 <split [0/0]> Fold5 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>

# etc etc
collect_metrics(spline_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   3.11      5   0.168 Preprocessor1_Model1
#> 2 rsq     standard   0.651     5   0.135 Preprocessor1_Model1

# However: 

fit_best(spline_res)
#> Error in `was_split_trimmed()` at tune/R/fit_best.R:148:3:
#> ! The split contains no `data` object. Was `trim_splits` set to `TRUE`
#>   in the control function?

^{Created on 2024-08-14 with reprex v2.1.0}

jrosell · 2024-10-09T08:31:18Z

In fact, when saving tune results I find that sometimes I would like to be able to reconstruct the split elements as resample again. See #947

So, I feel like they could be required to be trimmed or one could want to reuse them.

topepo added the feature a feature request or enhancement label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce the object size by trimming the split elements #930

reduce the object size by trimming the split elements #930

topepo commented Aug 15, 2024

topepo commented Aug 15, 2024

jrosell commented Oct 9, 2024 •

edited

Loading

reduce the object size by trimming the split elements #930

reduce the object size by trimming the split elements #930

Comments

topepo commented Aug 15, 2024

topepo commented Aug 15, 2024

jrosell commented Oct 9, 2024 • edited Loading

jrosell commented Oct 9, 2024 •

edited

Loading