Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values in importance weights #150

Open
koenniem opened this issue Apr 15, 2024 · 1 comment
Open

Missing values in importance weights #150

koenniem opened this issue Apr 15, 2024 · 1 comment
Labels
Question Question about behaviour of package

Comments

@koenniem
Copy link

koenniem commented Apr 15, 2024

The problem

I'm working with a dataset where I use importance weights to specify the misclassification costs of instances. Because the target class in the dataset is severly unbalanced, I would like to use some resample (e.g. SMOTE) to mitigate this issue. However, step_smote() and friends do not impute the importance weights, and I cannot impute them later because this is not allowed by other step_impute methods.

I can understand that the default behaviour should not be to generate new weights as well as this might lead to unexpected behaviour, but I do not see why the algorithms in this package would be unable to do this at all.

Reproducible example

Here is an example using hpc_data:

library(tidymodels)
library(themis)

# First, get rid of the nominal predictors as these cannot be used by `step_smote`
hpc_data <- select(hpc_data, -c(protocol, day))

# Now specify the importance weights, for example input_fields
hpc_data <- mutate(hpc_data, input_fields = importance_weights(input_fields))

# Specify a simple recipe to use with `step_smote`
rec <- recipe(class ~ ., data = hpc_data) |> 
    step_smote(class) 

# Now prep and bake as training data to see the result
rec |> 
    prep() |> 
    bake(NULL)
#> # A tibble: 8,844 × 6
#>    compounds input_fields iterations num_pending  hour class
#>        <dbl>    <imp_wts>      <dbl>       <dbl> <dbl> <fct>
#>  1       997          137         20           0  14   F    
#>  2        97          103         20           0  13.8 VF   
#>  3       101           75         10           0  13.8 VF   
#>  4        93           76         20           0  10.1 VF   
#>  5       100           82         20           0  10.4 VF   
#>  6       100           82         20           0  16.5 VF   
#>  7       105           88         20           0  16.4 VF   
#>  8        98           95         20           0  16.7 VF   
#>  9       101           91         20           0  16.2 VF   
#> 10        95           92         20           0  10.8 VF   
#> # ℹ 8,834 more rows

# This would leave us with 8844 rows, but there are many missing values in input_fields
rec |> 
    prep() |> 
    bake(NULL) |> 
    drop_na(input_fields) # Only 4331 rows left, the same amount as the original dataset
#> # A tibble: 4,331 × 6
#>    compounds input_fields iterations num_pending  hour class
#>        <dbl>    <imp_wts>      <dbl>       <dbl> <dbl> <fct>
#>  1       997          137         20           0  14   F    
#>  2        97          103         20           0  13.8 VF   
#>  3       101           75         10           0  13.8 VF   
#>  4        93           76         20           0  10.1 VF   
#>  5       100           82         20           0  10.4 VF   
#>  6       100           82         20           0  16.5 VF   
#>  7       105           88         20           0  16.4 VF   
#>  8        98           95         20           0  16.7 VF   
#>  9       101           91         20           0  16.2 VF   
#> 10        95           92         20           0  10.8 VF   
#> # ℹ 4,321 more rows

# On the other hand, `step_upsample()` does work
rec <- recipe(class ~ ., data = hpc_data) |> 
    step_upsample(class)

rec |> 
    prep() |> 
    bake(NULL) |> 
    drop_na(input_fields)
#> # A tibble: 8,844 × 6
#>    compounds input_fields iterations num_pending  hour class
#>        <dbl>    <imp_wts>      <dbl>       <dbl> <dbl> <fct>
#>  1        97          103         20           0 13.8  VF   
#>  2       101           75         10           0 13.8  VF   
#>  3        93           76         20           0 10.1  VF   
#>  4       100           82         20           0 10.4  VF   
#>  5       100           82         20           0 16.5  VF   
#>  6       105           88         20           0 16.4  VF   
#>  7        98           95         20           0 16.7  VF   
#>  8       101           91         20           0 16.2  VF   
#>  9        95           92         20           0 10.8  VF   
#> 10       102           96         20           0  9.97 VF   
#> # ℹ 8,834 more rows

Created on 2024-04-15 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United Kingdom.utf8
#>  ctype    English_United Kingdom.utf8
#>  tz       Europe/Brussels
#>  date     2024-04-15
#>  pandoc   3.1.1 @ C:/Workdir/MyApps/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  backports      1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
#>  broom        * 1.0.5      2023-06-09 [1] CRAN (R 4.3.1)
#>  class          7.3-22     2023-05-03 [1] CRAN (R 4.3.0)
#>  cli            3.6.2      2023-12-11 [1] CRAN (R 4.3.2)
#>  codetools      0.2-19     2023-02-01 [1] CRAN (R 4.2.2)
#>  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.2.2)
#>  data.table     1.15.0     2024-01-30 [1] CRAN (R 4.3.2)
#>  dials        * 1.2.1      2024-02-22 [1] CRAN (R 4.3.2)
#>  DiceDesign     1.10       2023-12-07 [1] CRAN (R 4.3.2)
#>  digest         0.6.34     2024-01-11 [1] CRAN (R 4.3.2)
#>  dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.0.5)
#>  evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.2)
#>  fansi          1.0.6      2023-12-08 [1] CRAN (R 4.3.2)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.2.2)
#>  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.1.3)
#>  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.1)
#>  furrr          0.3.1      2022-08-15 [1] CRAN (R 4.2.1)
#>  future         1.33.1     2023-12-22 [1] CRAN (R 4.3.2)
#>  future.apply   1.11.1     2023-12-21 [1] CRAN (R 4.3.2)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.5.0      2024-02-23 [1] CRAN (R 4.3.2)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.2.2)
#>  glue           1.7.0      2024-01-09 [1] CRAN (R 4.3.2)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.2.2)
#>  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.0.0)
#>  gtable         0.3.4      2023-08-21 [1] CRAN (R 4.3.1)
#>  hardhat        1.3.1      2024-02-02 [1] CRAN (R 4.3.2)
#>  htmltools      0.5.7      2023-11-03 [1] CRAN (R 4.3.2)
#>  infer        * 1.0.6      2024-01-31 [1] CRAN (R 4.3.2)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.2.2)
#>  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.1.3)
#>  knitr          1.45       2023-10-30 [1] CRAN (R 4.3.2)
#>  lattice        0.22-5     2023-10-24 [1] CRAN (R 4.3.2)
#>  lava           1.7.3      2023-11-04 [1] CRAN (R 4.3.2)
#>  lhs            1.1.6      2022-12-17 [1] CRAN (R 4.2.2)
#>  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
#>  listenv        0.9.1      2024-01-29 [1] CRAN (R 4.3.2)
#>  lubridate      1.9.3      2023-09-27 [1] CRAN (R 4.3.2)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.1.3)
#>  MASS           7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.2)
#>  Matrix         1.6-5      2024-01-11 [1] CRAN (R 4.3.2)
#>  modeldata    * 1.3.0      2024-01-21 [1] CRAN (R 4.3.2)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.0.0)
#>  nnet           7.3-19     2023-05-03 [1] CRAN (R 4.3.0)
#>  parallelly     1.37.1     2024-02-29 [1] CRAN (R 4.3.2)
#>  parsnip      * 1.2.0      2024-02-16 [1] CRAN (R 4.3.2)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.2.3)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.0.0)
#>  prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.2)
#>  purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.1)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.26.0     2024-01-24 [1] CRAN (R 4.3.2)
#>  R.utils        2.12.3     2023-11-18 [1] CRAN (R 4.3.2)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
#>  RANN           2.6.1      2019-01-08 [1] CRAN (R 4.0.0)
#>  Rcpp           1.0.12     2024-01-09 [1] CRAN (R 4.3.2)
#>  recipes      * 1.0.10     2024-02-18 [1] CRAN (R 4.3.2)
#>  reprex         2.1.0      2024-01-11 [1] CRAN (R 4.3.2)
#>  rlang          1.1.3      2024-01-10 [1] CRAN (R 4.3.2)
#>  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.2)
#>  ROSE           0.0-4      2021-06-14 [1] CRAN (R 4.3.3)
#>  rpart          4.1.23     2023-12-05 [1] CRAN (R 4.3.2)
#>  rsample      * 1.2.0      2023-08-23 [1] CRAN (R 4.3.1)
#>  rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.1)
#>  scales       * 1.3.0      2023-11-28 [1] CRAN (R 4.3.2)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
#>  styler         1.10.2     2023-08-29 [1] CRAN (R 4.3.1)
#>  survival       3.5-8      2024-02-14 [1] CRAN (R 4.3.2)
#>  themis       * 1.0.2      2023-08-14 [1] CRAN (R 4.3.3)
#>  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.2.3)
#>  tidymodels   * 1.1.1      2023-08-24 [1] CRAN (R 4.3.1)
#>  tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.3.2)
#>  tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.3.3)
#>  timechange     0.3.0      2024-01-18 [1] CRAN (R 4.3.2)
#>  timeDate       4032.109   2023-12-14 [1] CRAN (R 4.3.2)
#>  tune         * 1.1.2      2023-08-23 [1] CRAN (R 4.3.1)
#>  utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.2)
#>  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
#>  withr          3.0.0      2024-01-16 [1] CRAN (R 4.3.2)
#>  workflows    * 1.1.4      2024-02-19 [1] CRAN (R 4.3.2)
#>  workflowsets * 1.0.1      2023-04-06 [1] CRAN (R 4.2.3)
#>  xfun           0.42       2024-02-08 [1] CRAN (R 4.3.2)
#>  yaml           2.3.8      2023-12-11 [1] CRAN (R 4.3.2)
#>  yardstick    * 1.3.0      2024-01-19 [1] CRAN (R 4.3.2)
#> 
#>  [1] C:/Workdir/MyApps/R-Library/4.0
#>  [2] C:/Workdir/MyApps/R/R-4.3.3/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Proposed solution

I'm wondering whether this behaviour can be implemented in the functions in this work package, if necessary not as the default behaviour. If there is some other solution that I've missed, I'd be more than happy to learn more about it.

@EmilHvitfeldt EmilHvitfeldt added the Question Question about behaviour of package label Apr 17, 2024
@EmilHvitfeldt
Copy link
Member

Hello @koenniem 👋

The main reason why steps such as step_smote() doesn't work with importance weights, is because there isn't information as to how the weights should be inputed.

Propose for example that the importance weight is a measure of oldness. how should step_smote() fill in the weights? there is no assumption that the weights have any relation to the predictors. So it is doing the best it can and fill in with NA.

If you have prior knowledge, you could use step_mutate(w = if_else(is.na(w), importance_weights(52), w)) but you should be very careful when doing it.

And honestly, you would be better off using step_upsample() or step_downsample() as they work with weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Question about behaviour of package
Projects
None yet
Development

No branches or pull requests

2 participants