-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Feature]: "cube" method for balanced sampling. #88
Comments
Hello @coforfe 👋 Looks interesting! Could you say a little bit how this methods to be used to handle unbalanced data? |
Hi Emil, Yes, perhaps with this commented example, it is more clear. #-------------------- Library Loading --------------
library(sampling) # To use samplecube function - slow way
library(scorecard) # To get germancredit data.
library(dplyr) # Data wrangling.
library(janitor) # To clean names.
library(recipes) # To apply some transformations.
#------ Some checks.
#-- Target Distribution
table(germancredit$creditability)
# bad good
# 300 700
#------ Let's sample 300 rows from the "good" class.
#-- For demonstration purposes we will use just a subset of columns.
#-- Some numeric columns and some of the character columns.
german_good <- germancredit %>%
clean_names() %>%
filter(creditability == "good") %>%
select( duration_in_month, credit_amount, age_in_years,
status_of_existing_checking_account, credit_history,
savings_account_and_bonds, foreign_worker, creditability)
german_good_tr <- recipe( creditability ~ . , data = german_good) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
prep() %>% bake( new_data = NULL) %>%
select(-creditability) %>%
clean_names()
#---- Preparation for samplecube
target_df <- as.data.frame(table(germancredit$creditability))
perc_minority <- min(target_df$Freq) / max(target_df$Freq)
prob_minority <- rep(perc_minority, nrow(german_good_tr))
german_good_matrix <- as.matrix(german_good_tr)
#------------------------ SLOW APPROACH --------------
#---- Cube sampling gets back row indexes to sample.
s_slow <- samplecube( X = german_good_matrix, pik = prob_minority, order = 2, comment = FALSE, method = 1)
#---- Select subsample
german_slow <- german_good[ s_slow, ]
#--- Final sampled data.frame
german_sampled_slow <- rbind(
german_slow,
germancredit %>% filter(creditability == "bad") %>%
clean_names() %>%
select( duration_in_month, credit_amount, age_in_years,
status_of_existing_checking_account, credit_history,
savings_account_and_bonds, foreign_worker, creditability)
)
#--------------------- FAST APPROACH -------------
#--------------------- By using fast "BalancedSampling".
library(BalancedSampling)
#--- "cube" function gets back row indexes to sample.
s_fast <- cube( p = prob_minority, X = german_good_matrix)
#---- Select subsample
german_fast <- german_good[ s_fast, ]
#--- Final sampled data.frame
german_sampled_fast <- rbind(
german_fast,
germancredit %>% filter(creditability == "bad") %>%
clean_names() %>%
select( duration_in_month, credit_amount, age_in_years,
status_of_existing_checking_account, credit_history,
savings_account_and_bonds, foreign_worker, creditability)
)
Thanks, |
EmilHvitfeldt
added
feature
a feature request or enhancement
New Step
a new recipe step
labels
Apr 8, 2022
That looks like it would work. I'll get back to it in a little bit, I'm working on some higher priority projects these few months. |
Thanks Emil!. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Emil,
Thanks for your work on this package.
Feature
I recently discovered this balance sampling method
cube
(paper). I think it could be very nice to have it included here inthemis
.It is already included in package BalancedSampling which offers a very fast (C++) implementation.
Thanks again,
Carlos.
The text was updated successfully, but these errors were encountered: