-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fairness metrics #434
add fairness metrics #434
Conversation
The forthcoming metrics execute these functions via their constructor and thus introduce errors at `load_all()`. We could also instead rename the fairness files to be run last, e.g. `zzz-fair.R`.
This doesn't actually function independently as a commit, as I've linked out to the metrics made with this function in docs, but is otherwise able to stand on its own.
`demographic_parity()`, `equalized_odds()`, and `equal_opportunity()` are all special cases of `fairness_metrics()`.
Condition | ||
Error in `dplyr::group_by()`: | ||
! Must group by variables found in `.data`. | ||
x Column `nonexistent_column` is not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairness_metric()
currently defers to group_by()
to raise errors for columns that don't exist, and this is surfaced in the error context.
R/fair-aaa.R
Outdated
#' ) | ||
#' | ||
#' @export | ||
fairness_metric <- function(.fn, .name, .post) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second and third issues linked in the PR description may be related here—there's nothing specific to only fairness about this function for now besides naming choices. This might be a solution for grouped metrics generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would lean toward changing the naming here (and the direction = "minimize"
default?) so that it is more clearly groupwise_metric()
or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm game! Thank you.
A difficult bit here is that all yardstick metrics know about groups, so I want to make sure we don't imply that non-fairness-metrics aren't group-aware, there just isn't an intermediate grouped operation happening under the hood. I do think that groupwise_metric()
could be a good way to phrase that (accompanied by strong docs), but also very much open to other ideas, esp. if there's some dplyr-ish concept that already speaks to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes from the group meeting:
- Max mentioned that it might be nice to prefix whatever this function name is with
create_
or some other eliciting verb to indicate that this is a function factory, and others agreed - I suggested
disparity_metric
as a descriptor for this type of metric that doesn't have as strong of a social connotation—seems like "disparity" could describe differences across groups regardless of whether that group is regarded as a sensitive feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose moving forward with create_disparity_metric()
. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using create_*
here, but it is a departure compared to the other function factories in yardstick (of which there are a lot, like metric_tweak()
, metric_set()
, and so forth). Do you think it's better to stay more similar to the naming conventions of yardstick, or to use something like create_*
?
I have a mild preference for something like create_groupwise_metric()
because I think there is more ML community vocabulary around what "groupwise" means. The word "disparity" makes me think about the specific metric disparate impact. That being said, my opinion is not super strong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points--if we want to look to other function factories in the package, maybe the parallel we might want to draw is with new_metric()
? Something like new_groupwise_metric()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that a lot, new_groupwise_metric()
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that this approach (with .fn
and .post
) looks excellent. We've got the basic fairness metrics included, but the ability to make new ones is fairly straightforward as well. FWIW here it feels quite a bit easier to make a new fairness metric function than to make a custom yardstick metric.
Given that we are doing a lot of function factory type of stuff, it might be nice to give folks a little more help in learning/dealing with the inputs and outputs.
- Maybe a link to here in the docs for the current
fairness_metric()
(plus use the phrase "function factory"): https://adv-r.hadley.nz/function-factories.html - I think a print method for the
metric
functions, if people are going to be creating them and dealing with them:
yardstick::sens
#> function (data, ...)
#> {
#> UseMethod("sens")
#> }
#> <bytecode: 0x117ee7e30>
#> <environment: namespace:yardstick>
#> attr(,"direction")
#> [1] "maximize"
#> attr(,"class")
#> [1] "class_metric" "metric" "function"
## just an idea:
format.metric <- function(x, ...) {
first_class <- class(x)[[1]]
cli::cli_format_method({
cli::cli_h3("A {.cls {first_class}} function to {attr(x, 'direction')} metrics")
})
}
print.metric <- function(x, ...) {
cat(format(x), sep = "\n")
invisible(x)
}
yardstick::sens
#>
#> ── A <class_metric> function to maximize metrics
Created on 2023-05-15 with reprex v2.0.2
R/fair-aaa.R
Outdated
#' ) | ||
#' | ||
#' @export | ||
fairness_metric <- function(.fn, .name, .post) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would lean toward changing the naming here (and the direction = "minimize"
default?) so that it is more clearly groupwise_metric()
or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited about these changed!
R/fair-aaa.R
Outdated
#' ) | ||
#' | ||
#' @export | ||
fairness_metric <- function(.fn, .name, .post) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.
R/fair-aaa.R
Outdated
diff_range <- function(x, ...) { | ||
estimates <- x$.estimate | ||
|
||
max(estimates) - min(estimates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add na.rm
to both of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly? Should test interactions with na_rm
here.
#' output of this function can be used. | ||
#' | ||
#' @examples | ||
#' data(hpc_cv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to move forward with a fairness-oriented name, it'd be great if we could use some example data here that has a plausibly "sensitive" attribute. yardstick doesn't Suggests
modeldata at the moment, which has some options. infer::gss
would also work well here.
re: #434 (comment) I think it's fine, even preferable, that the interface makes clear that
I'm not on board for making e.g. metric_set(
roc_auc,
demographic_parity
) ...specifically, nudging people to the right documentation to learn how to use the disparity metric functions:
...where "disparity metric function" is a link to the docs. |
now refers to the yardstick functions and `.post` values in docs. also, moves `max_positive_rate_diff()` to the top of the file so that `fairness_metric()` doesn't error out on load.
FWIW I also think it is a good thing (not a problem) that the syntax emphasizes that |
I think part of the confusion arrises on how we use these functions. If we change metric_set(
roc_auc,
accuracy,
demographic_parity(cyl)
) To the following, then we are no longer using demographic_parrity_cyl <- demographic_parity(cyl)
metric_set(
roc_auc,
accuracy,
demographic_parrity_cyl
) In the end this code is a little tricky because we are using a function factory ( I'm happy with the interface as it stands right now. And I think we can get a lot of leeway by using better printing, plus we can add some robust error handling in |
gives `fairness_metric()` output an additional `metric_factory` class
library(yardstick)
metric_set(demographic_parity)
#> Error in `metric_set()`:
#> ! The input `demographic_parity` is a group-wise metric
#> (`?yardstick::new_groupwise_metric()`) factory and must be passed a data-column
#> before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`?
metric_set(demographic_parity, equal_opportunity)
#> Error in `metric_set()`:
#> ! The inputs `demographic_parity` and `equal_opportunity` are group-wise
#> metric (`?yardstick::new_groupwise_metric()`) factories and must be passed a
#> data-column before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`? Created on 2023-06-22 with reprex v2.0.2 |
@EmilHvitfeldt Some big-picture for re-review: I'd like to discuss + make some changes related to #434 (comment) before merging! The next steps for this work are
If you'd like 1) to happen in this PR, that totally works, otherwise I'll plan to follow up on this PR with a smaller, separate one. After 1-3, it's time to delve into functionality for mitigation.🏄 |
I would like the vignette work to happen in a new PR. It will be cleaner that way! |
Realizing that |
Pkgdown CI doesn't work because http://r-project.org/ is down. I'm merging anyways |
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
First, an example with tune:
This PR implements a fairness metric constructor,
fairness_metric()
, as well as 3 outputs of it giving canonical fairness metrics. One of them isdemographic_parity()
:Under the hood, the 3 new metrics are “group-aware”, associated with a “by” data-column that, given some dataset, generates pre-existing yardstick metrics by group and then summarizes across groups back to 1 number, the usual level of observation.
They’re created with
fairness_metric()
—the actual definition ofdemographic_parity()
looks like:The idea here is:
fairness_metric()
to create fairness metrics with their modeling context in mind.This is already a large PR, though there’s a lot more to do here. I felt that this was enough of a start to give a solid roadmap if we were to move ahead with this approach to assessment.
Related to #176, #371, #421. tune PR following up here coming in a moment. :)
Reviewing commit-by-commit should be easier than via the whole PR!