Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fairness metrics #434

Merged
merged 34 commits into from
Oct 27, 2023
Merged

add fairness metrics #434

merged 34 commits into from
Oct 27, 2023

Conversation

simonpcouch
Copy link
Contributor

First, an example with tune:

library(tidymodels)

# check out gss data from the infer package
gss
#> # A tibble: 500 × 11
#>     year   age sex    college   partyid hompop hours income class finrela weight
#>    <dbl> <dbl> <fct>  <fct>     <fct>    <dbl> <dbl> <ord>  <fct> <fct>    <dbl>
#>  1  2014    36 male   degree    ind          3    50 $2500… midd… below …  0.896
#>  2  1994    34 female no degree rep          4    31 $2000… work… below …  1.08 
#>  3  1998    24 male   degree    ind          1    40 $2500… work… below …  0.550
#>  4  1996    42 male   no degree ind          4    40 $2500… work… above …  1.09 
#>  5  1994    31 male   degree    rep          2    40 $2500… midd… above …  1.08 
#>  6  1996    32 female no degree rep          4    53 $2500… midd… average  1.09 
#>  7  1990    48 female no degree dem          2    32 $2500… work… below …  1.06 
#>  8  2016    36 female degree    ind          1    20 $2500… midd… above …  0.478
#>  9  2000    30 female degree    rep          5    40 $2500… midd… average  1.10 
#> 10  1998    33 female no degree dem          2    40 $1500… work… far be…  0.550
#> # ℹ 490 more rows

# tune an xgboost model on college completion
res <-
  tune_grid(
    boost_tree("classification", trees = tune(), min_n = tune()),
    college ~ age + income,
    vfold_cv(gss, v = 10, repeats = 2),
    metrics = metric_set(roc_auc, demographic_parity(sex))
  )

collect_metrics(res)
#> # A tibble: 20 × 9
#>    trees min_n .metric            .estimator .by     mean     n std_err .config 
#>    <int> <int> <chr>              <chr>      <chr>  <dbl> <int>   <dbl> <chr>   
#>  1  1028     3 demographic_parity binary     sex   0.105     20 0.0160  Preproc…
#>  2  1028     3 roc_auc            binary     <NA>  0.611     20 0.0238  Preproc…
#>  3  1767     6 demographic_parity binary     sex   0.109     20 0.0161  Preproc…
#>  4  1767     6 roc_auc            binary     <NA>  0.636     20 0.0242  Preproc…
#>  5   525    12 demographic_parity binary     sex   0.107     20 0.0146  Preproc…
#>  6   525    12 roc_auc            binary     <NA>  0.643     20 0.0225  Preproc…
#>  7   998    17 demographic_parity binary     sex   0.0689    20 0.0128  Preproc…
#>  8   998    17 roc_auc            binary     <NA>  0.632     20 0.0215  Preproc…
#>  9   640    20 demographic_parity binary     sex   0.0669    20 0.00920 Preproc…
#> 10   640    20 roc_auc            binary     <NA>  0.626     20 0.0208  Preproc…
#> 11    38    22 demographic_parity binary     sex   0.0683    20 0.0160  Preproc…
#> 12    38    22 roc_auc            binary     <NA>  0.666     20 0.0207  Preproc…
#> 13  1372    25 demographic_parity binary     sex   0.0727    20 0.0138  Preproc…
#> 14  1372    25 roc_auc            binary     <NA>  0.621     20 0.0219  Preproc…
#> 15  1828    30 demographic_parity binary     sex   0.0527    20 0.0122  Preproc…
#> 16  1828    30 roc_auc            binary     <NA>  0.616     20 0.0177  Preproc…
#> 17  1509    33 demographic_parity binary     sex   0.0544    20 0.00968 Preproc…
#> 18  1509    33 roc_auc            binary     <NA>  0.614     20 0.0174  Preproc…
#> 19   252    39 demographic_parity binary     sex   0.0370    20 0.00633 Preproc…
#> 20   252    39 roc_auc            binary     <NA>  0.633     20 0.0125  Preproc…

autoplot(res)

This PR implements a fairness metric constructor, fairness_metric(), as well as 3 outputs of it giving canonical fairness metrics. One of them is demographic_parity():

# quick data setup: 
gss$college_pred <- gss$college
gss$college_pred[sample(1:nrow(gss), 10)] <- "degree"

dp_sex <- demographic_parity(sex)

# demographic_parity(by) output is just a yardstick metric:
class(dp_sex)
#> [1] "class_metric" "metric"       "function"

# user interacts with it like any other:
dp_sex(gss, truth = college, estimate = college_pred)
#> # A tibble: 1 × 4
#>   .metric            .by   .estimator .estimate
#>   <chr>              <chr> <chr>          <dbl>
#> 1 demographic_parity sex   binary        0.0135

Under the hood, the 3 new metrics are “group-aware”, associated with a “by” data-column that, given some dataset, generates pre-existing yardstick metrics by group and then summarizes across groups back to 1 number, the usual level of observation.

They’re created with fairness_metric()—the actual definition of demographic_parity() looks like:

diff_range <- function(x, ...) {
  diff(range(x$.estimate))
}

demographic_parity <-
  fairness_metric(
    .fn = detection_prevalence,
    .name = "demographic_parity",
    .post = diff_range
  )

The idea here is:

  1. Make the interface feel as tidymodels-idiomatic as possible—existing idioms “just work”, and
  2. Create a minimal set of canonical metrics using flexible user-facing tooling; we want to encourage users (in further vignettes + docs) to make use of fairness_metric() to create fairness metrics with their modeling context in mind.

This is already a large PR, though there’s a lot more to do here. I felt that this was enough of a start to give a solid roadmap if we were to move ahead with this approach to assessment.

Related to #176, #371, #421. tune PR following up here coming in a moment. :)

Reviewing commit-by-commit should be easier than via the whole PR!

The forthcoming metrics execute these functions via their constructor and thus introduce errors at `load_all()`. We could also instead rename the fairness files to be run last, e.g. `zzz-fair.R`.
This doesn't actually function independently as a commit, as I've linked out to the metrics made with this function in docs, but is otherwise able to stand on its own.
`demographic_parity()`, `equalized_odds()`, and `equal_opportunity()` are all special cases of `fairness_metrics()`.
R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
Condition
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
x Column `nonexistent_column` is not found.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairness_metric() currently defers to group_by() to raise errors for columns that don't exist, and this is surfaced in the error context.

R/fair-aaa.R Outdated
#' )
#'
#' @export
fairness_metric <- function(.fn, .name, .post) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second and third issues linked in the PR description may be related here—there's nothing specific to only fairness about this function for now besides naming choices. This might be a solution for grouped metrics generally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would lean toward changing the naming here (and the direction = "minimize" default?) so that it is more clearly groupwise_metric() or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm game! Thank you.

A difficult bit here is that all yardstick metrics know about groups, so I want to make sure we don't imply that non-fairness-metrics aren't group-aware, there just isn't an intermediate grouped operation happening under the hood. I do think that groupwise_metric() could be a good way to phrase that (accompanied by strong docs), but also very much open to other ideas, esp. if there's some dplyr-ish concept that already speaks to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from the group meeting:

  • Max mentioned that it might be nice to prefix whatever this function name is with create_ or some other eliciting verb to indicate that this is a function factory, and others agreed
  • I suggested disparity_metric as a descriptor for this type of metric that doesn't have as strong of a social connotation—seems like "disparity" could describe differences across groups regardless of whether that group is regarded as a sensitive feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose moving forward with create_disparity_metric(). Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using create_* here, but it is a departure compared to the other function factories in yardstick (of which there are a lot, like metric_tweak(), metric_set(), and so forth). Do you think it's better to stay more similar to the naming conventions of yardstick, or to use something like create_*?

I have a mild preference for something like create_groupwise_metric() because I think there is more ML community vocabulary around what "groupwise" means. The word "disparity" makes me think about the specific metric disparate impact. That being said, my opinion is not super strong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points--if we want to look to other function factories in the package, maybe the parallel we might want to draw is with new_metric()? Something like new_groupwise_metric()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that a lot, new_groupwise_metric() 👍

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that this approach (with .fn and .post) looks excellent. We've got the basic fairness metrics included, but the ability to make new ones is fairly straightforward as well. FWIW here it feels quite a bit easier to make a new fairness metric function than to make a custom yardstick metric.

Given that we are doing a lot of function factory type of stuff, it might be nice to give folks a little more help in learning/dealing with the inputs and outputs.

  • Maybe a link to here in the docs for the current fairness_metric() (plus use the phrase "function factory"): https://adv-r.hadley.nz/function-factories.html
  • I think a print method for the metric functions, if people are going to be creating them and dealing with them:
yardstick::sens
#> function (data, ...) 
#> {
#>     UseMethod("sens")
#> }
#> <bytecode: 0x117ee7e30>
#> <environment: namespace:yardstick>
#> attr(,"direction")
#> [1] "maximize"
#> attr(,"class")
#> [1] "class_metric" "metric"       "function"

## just an idea:
format.metric <- function(x, ...) {
    first_class <- class(x)[[1]]
    cli::cli_format_method({
        cli::cli_h3("A {.cls {first_class}} function to {attr(x, 'direction')} metrics")
    })
}

print.metric <- function(x, ...) {
    cat(format(x), sep = "\n")
    invisible(x)
}

yardstick::sens
#> 
#> ── A <class_metric> function to maximize metrics

Created on 2023-05-15 with reprex v2.0.2

R/fair-aaa.R Outdated
#' )
#'
#' @export
fairness_metric <- function(.fn, .name, .post) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would lean toward changing the naming here (and the direction = "minimize" default?) so that it is more clearly groupwise_metric() or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.

Copy link
Member

@EmilHvitfeldt EmilHvitfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited about these changed!

R/aaa-metrics.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated
#' )
#'
#' @export
fairness_metric <- function(.fn, .name, .post) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.

R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-demographic_parity.R Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
_pkgdown.yml Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated Show resolved Hide resolved
R/fair-aaa.R Outdated
diff_range <- function(x, ...) {
estimates <- x$.estimate

max(estimates) - min(estimates)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add na.rm to both of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly? Should test interactions with na_rm here.

R/fair-aaa.R Outdated Show resolved Hide resolved
#' output of this function can be used.
#'
#' @examples
#' data(hpc_cv)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to move forward with a fairness-oriented name, it'd be great if we could use some example data here that has a plausibly "sensitive" attribute. yardstick doesn't Suggests modeldata at the moment, which has some options. infer::gss would also work well here.

R/fair-aaa.R Outdated Show resolved Hide resolved
@simonpcouch
Copy link
Contributor Author

re: #434 (comment)

I think it's fine, even preferable, that the interface makes clear that roc_auc and demographic_parity(cyl) are the same type of thing (and thus roc_auc and demographic_parity are different). metric_set is itself a function factory, so I don't think it's unrealistic for us to expect our users to be able to intuit with functions that output functions.

How can we differentiate the classes of metric functions to avoid the appearance of discordant inputs (function names vs executed functions).

I'm not on board for making e.g. roc_auc and demographic_parity(cyl) have different classes, but I would definitely be up for giving demographic_parity a special function subclass. In that case, we could check for the mistake:

metric_set(
  roc_auc, 
  demographic_parity
)

...specifically, nudging people to the right documentation to learn how to use the disparity metric functions:

! The input `demographic_parity` is a disparity metric function, and must be associated with a data-column.
* Did you mean to type `demographic_parity(col_name)`?

...where "disparity metric function" is a link to the docs.

now refers to the yardstick functions and `.post` values in docs.

also, moves `max_positive_rate_diff()` to the top of the file so that `fairness_metric()` doesn't error out on load.
@juliasilge
Copy link
Member

FWIW I also think it is a good thing (not a problem) that the syntax emphasizes that demographic_parity(cyl) is the same kind of thing (a function) as roc_auc. Using a subclass would be a good option to help with how folks handle these fairness metrics, and relates back to what I mentioned before about a print method for metrics in general.

@EmilHvitfeldt
Copy link
Member

I think part of the confusion arrises on how we use these functions.

If we change

metric_set(
  roc_auc, 
  accuracy, 
  demographic_parity(cyl)
)

To the following, then we are no longer using () inside metric_set()

demographic_parrity_cyl <- demographic_parity(cyl)

metric_set(
  roc_auc, 
  accuracy, 
  demographic_parrity_cyl
)

In the end this code is a little tricky because we are using a function factory (demographic_parity()) to create a function that is passed to a different function factory (metric_set()).

I'm happy with the interface as it stands right now. And I think we can get a lot of leeway by using better printing, plus we can add some robust error handling in metric_set()

gives `fairness_metric()` output an additional `metric_factory` class
@simonpcouch
Copy link
Contributor Author

simonpcouch commented Jun 22, 2023

metric_set() will now error (extra) informatively when passed an unevaluated fairness metric function! Interactively, "group-wise metric" is a hyperlink to ?yardstick::new_groupwise_metric().

library(yardstick)
  
metric_set(demographic_parity)
#> Error in `metric_set()`:
#> ! The input `demographic_parity` is a group-wise metric
#>   (`?yardstick::new_groupwise_metric()`) factory and must be passed a data-column
#>   before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`?

metric_set(demographic_parity, equal_opportunity)
#> Error in `metric_set()`:
#> ! The inputs `demographic_parity` and `equal_opportunity` are group-wise
#>   metric (`?yardstick::new_groupwise_metric()`) factories and must be passed a
#>   data-column before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`?

Created on 2023-06-22 with reprex v2.0.2

R/fair-aaa.R Outdated Show resolved Hide resolved
@simonpcouch
Copy link
Contributor Author

simonpcouch commented Jun 26, 2023

@EmilHvitfeldt Some big-picture for re-review:

I'd like to discuss + make some changes related to #434 (comment) before merging!

The next steps for this work are

  1. a vignette (/article?) in yardstick using a proper example to demonstrate using fairness metrics,

  2. a tidymodels.org blog post with a separate example that uses these metrics while tuning and makes use of custom fairness metrics, and then

  3. a tidyverse blog post giving the bigger picture of this work and linking out to 1) and 2).

If you'd like 1) to happen in this PR, that totally works, otherwise I'll plan to follow up on this PR with a smaller, separate one.

After 1-3, it's time to delve into functionality for mitigation.🏄

@EmilHvitfeldt
Copy link
Member

I would like the vignette work to happen in a new PR. It will be cleaner that way!

R/fair-aaa.R Outdated Show resolved Hide resolved
@simonpcouch
Copy link
Contributor Author

Realizing that aggregate is misspelled aggregrate---one moment.🙃

@EmilHvitfeldt
Copy link
Member

Pkgdown CI doesn't work because http://r-project.org/ is down. I'm merging anyways

@EmilHvitfeldt EmilHvitfeldt merged commit ce03a94 into main Oct 27, 2023
11 of 12 checks passed
@EmilHvitfeldt EmilHvitfeldt deleted the fairness branch October 27, 2023 21:17
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants