add fairness metrics #434

simonpcouch · 2023-05-15T12:51:48Z

First, an example with tune:

library(tidymodels)

# check out gss data from the infer package
gss
#> # A tibble: 500 × 11
#>     year   age sex    college   partyid hompop hours income class finrela weight
#>    <dbl> <dbl> <fct>  <fct>     <fct>    <dbl> <dbl> <ord>  <fct> <fct>    <dbl>
#>  1  2014    36 male   degree    ind          3    50 $2500… midd… below …  0.896
#>  2  1994    34 female no degree rep          4    31 $2000… work… below …  1.08 
#>  3  1998    24 male   degree    ind          1    40 $2500… work… below …  0.550
#>  4  1996    42 male   no degree ind          4    40 $2500… work… above …  1.09 
#>  5  1994    31 male   degree    rep          2    40 $2500… midd… above …  1.08 
#>  6  1996    32 female no degree rep          4    53 $2500… midd… average  1.09 
#>  7  1990    48 female no degree dem          2    32 $2500… work… below …  1.06 
#>  8  2016    36 female degree    ind          1    20 $2500… midd… above …  0.478
#>  9  2000    30 female degree    rep          5    40 $2500… midd… average  1.10 
#> 10  1998    33 female no degree dem          2    40 $1500… work… far be…  0.550
#> # ℹ 490 more rows

# tune an xgboost model on college completion
res <-
  tune_grid(
    boost_tree("classification", trees = tune(), min_n = tune()),
    college ~ age + income,
    vfold_cv(gss, v = 10, repeats = 2),
    metrics = metric_set(roc_auc, demographic_parity(sex))
  )

collect_metrics(res)
#> # A tibble: 20 × 9
#>    trees min_n .metric            .estimator .by     mean     n std_err .config 
#>    <int> <int> <chr>              <chr>      <chr>  <dbl> <int>   <dbl> <chr>   
#>  1  1028     3 demographic_parity binary     sex   0.105     20 0.0160  Preproc…
#>  2  1028     3 roc_auc            binary     <NA>  0.611     20 0.0238  Preproc…
#>  3  1767     6 demographic_parity binary     sex   0.109     20 0.0161  Preproc…
#>  4  1767     6 roc_auc            binary     <NA>  0.636     20 0.0242  Preproc…
#>  5   525    12 demographic_parity binary     sex   0.107     20 0.0146  Preproc…
#>  6   525    12 roc_auc            binary     <NA>  0.643     20 0.0225  Preproc…
#>  7   998    17 demographic_parity binary     sex   0.0689    20 0.0128  Preproc…
#>  8   998    17 roc_auc            binary     <NA>  0.632     20 0.0215  Preproc…
#>  9   640    20 demographic_parity binary     sex   0.0669    20 0.00920 Preproc…
#> 10   640    20 roc_auc            binary     <NA>  0.626     20 0.0208  Preproc…
#> 11    38    22 demographic_parity binary     sex   0.0683    20 0.0160  Preproc…
#> 12    38    22 roc_auc            binary     <NA>  0.666     20 0.0207  Preproc…
#> 13  1372    25 demographic_parity binary     sex   0.0727    20 0.0138  Preproc…
#> 14  1372    25 roc_auc            binary     <NA>  0.621     20 0.0219  Preproc…
#> 15  1828    30 demographic_parity binary     sex   0.0527    20 0.0122  Preproc…
#> 16  1828    30 roc_auc            binary     <NA>  0.616     20 0.0177  Preproc…
#> 17  1509    33 demographic_parity binary     sex   0.0544    20 0.00968 Preproc…
#> 18  1509    33 roc_auc            binary     <NA>  0.614     20 0.0174  Preproc…
#> 19   252    39 demographic_parity binary     sex   0.0370    20 0.00633 Preproc…
#> 20   252    39 roc_auc            binary     <NA>  0.633     20 0.0125  Preproc…

autoplot(res)

This PR implements a fairness metric constructor, fairness_metric(), as well as 3 outputs of it giving canonical fairness metrics. One of them is demographic_parity():

# quick data setup: 
gss$college_pred <- gss$college
gss$college_pred[sample(1:nrow(gss), 10)] <- "degree"

dp_sex <- demographic_parity(sex)

# demographic_parity(by) output is just a yardstick metric:
class(dp_sex)
#> [1] "class_metric" "metric"       "function"

# user interacts with it like any other:
dp_sex(gss, truth = college, estimate = college_pred)
#> # A tibble: 1 × 4
#>   .metric            .by   .estimator .estimate
#>   <chr>              <chr> <chr>          <dbl>
#> 1 demographic_parity sex   binary        0.0135

Under the hood, the 3 new metrics are “group-aware”, associated with a “by” data-column that, given some dataset, generates pre-existing yardstick metrics by group and then summarizes across groups back to 1 number, the usual level of observation.

They’re created with fairness_metric()—the actual definition of demographic_parity() looks like:

diff_range <- function(x, ...) {
  diff(range(x$.estimate))
}

demographic_parity <-
  fairness_metric(
    .fn = detection_prevalence,
    .name = "demographic_parity",
    .post = diff_range
  )

The idea here is:

Make the interface feel as tidymodels-idiomatic as possible—existing idioms “just work”, and
Create a minimal set of canonical metrics using flexible user-facing tooling; we want to encourage users (in further vignettes + docs) to make use of fairness_metric() to create fairness metrics with their modeling context in mind.

This is already a large PR, though there’s a lot more to do here. I felt that this was enough of a start to give a solid roadmap if we were to move ahead with this approach to assessment.

Related to #176, #371, #421. tune PR following up here coming in a moment. :)

Reviewing commit-by-commit should be easier than via the whole PR!

The forthcoming metrics execute these functions via their constructor and thus introduce errors at `load_all()`. We could also instead rename the fairness files to be run last, e.g. `zzz-fair.R`.

This doesn't actually function independently as a commit, as I've linked out to the metrics made with this function in docs, but is otherwise able to stand on its own.

`demographic_parity()`, `equalized_odds()`, and `equal_opportunity()` are all special cases of `fairness_metrics()`.

R/fair-aaa.R

simonpcouch · 2023-05-15T13:35:07Z

tests/testthat/_snaps/fair-aaa.md

+    Condition
+      Error in `dplyr::group_by()`:
+      ! Must group by variables found in `.data`.
+      x Column `nonexistent_column` is not found.


fairness_metric() currently defers to group_by() to raise errors for columns that don't exist, and this is surfaced in the error context.

simonpcouch · 2023-05-15T13:37:02Z

R/fair-aaa.R

+#'   )
+#'
+#' @export
+fairness_metric <- function(.fn, .name, .post) {


The second and third issues linked in the PR description may be related here—there's nothing specific to only fairness about this function for now besides naming choices. This might be a solution for grouped metrics generally.

Yes, I would lean toward changing the naming here (and the direction = "minimize" default?) so that it is more clearly groupwise_metric() or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.

I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.

I'm game! Thank you.

A difficult bit here is that all yardstick metrics know about groups, so I want to make sure we don't imply that non-fairness-metrics aren't group-aware, there just isn't an intermediate grouped operation happening under the hood. I do think that groupwise_metric() could be a good way to phrase that (accompanied by strong docs), but also very much open to other ideas, esp. if there's some dplyr-ish concept that already speaks to this.

Notes from the group meeting:

Max mentioned that it might be nice to prefix whatever this function name is with create_ or some other eliciting verb to indicate that this is a function factory, and others agreed

I suggested disparity_metric as a descriptor for this type of metric that doesn't have as strong of a social connotation—seems like "disparity" could describe differences across groups regardless of whether that group is regarded as a sensitive feature

I propose moving forward with create_disparity_metric(). Any thoughts?

I like the idea of using create_* here, but it is a departure compared to the other function factories in yardstick (of which there are a lot, like metric_tweak(), metric_set(), and so forth). Do you think it's better to stay more similar to the naming conventions of yardstick, or to use something like create_*?

I have a mild preference for something like create_groupwise_metric() because I think there is more ML community vocabulary around what "groupwise" means. The word "disparity" makes me think about the specific metric disparate impact. That being said, my opinion is not super strong here.

Good points--if we want to look to other function factories in the package, maybe the parallel we might want to draw is with new_metric()? Something like new_groupwise_metric()?

I like that a lot, new_groupwise_metric() 👍

juliasilge

My opinion is that this approach (with .fn and .post) looks excellent. We've got the basic fairness metrics included, but the ability to make new ones is fairly straightforward as well. FWIW here it feels quite a bit easier to make a new fairness metric function than to make a custom yardstick metric.

Given that we are doing a lot of function factory type of stuff, it might be nice to give folks a little more help in learning/dealing with the inputs and outputs.

Maybe a link to here in the docs for the current fairness_metric() (plus use the phrase "function factory"): https://adv-r.hadley.nz/function-factories.html
I think a print method for the metric functions, if people are going to be creating them and dealing with them:

yardstick::sens
#> function (data, ...) 
#> {
#>     UseMethod("sens")
#> }
#> <bytecode: 0x117ee7e30>
#> <environment: namespace:yardstick>
#> attr(,"direction")
#> [1] "maximize"
#> attr(,"class")
#> [1] "class_metric" "metric"       "function"

## just an idea:
format.metric <- function(x, ...) {
    first_class <- class(x)[[1]]
    cli::cli_format_method({
        cli::cli_h3("A {.cls {first_class}} function to {attr(x, 'direction')} metrics")
    })
}

print.metric <- function(x, ...) {
    cat(format(x), sep = "\n")
    invisible(x)
}

yardstick::sens
#> 
#> ── A <class_metric> function to maximize metrics

^{Created on 2023-05-15 with reprex v2.0.2}

juliasilge · 2023-05-15T18:43:08Z

R/fair-aaa.R

+#'   )
+#'
+#' @export
+fairness_metric <- function(.fn, .name, .post) {


Yes, I would lean toward changing the naming here (and the direction = "minimize" default?) so that it is more clearly groupwise_metric() or similar. I'd change the section in the pkgdown site to something like "Fairness and Group Metrics". I think this is the right way to go both because folks have non-fairness group metric needs, and because then the name helps users understand how fairness metrics work. I think it's better for learning/using, not worse.

EmilHvitfeldt

Super excited about these changed!

R/aaa-metrics.R

EmilHvitfeldt · 2023-05-15T21:57:48Z

R/fair-aaa.R

+#'   )
+#'
+#' @export
+fairness_metric <- function(.fn, .name, .post) {


I totally agree, switching this file over to talk about them as "group-wise" metrics is the right move.

R/fair-aaa.R

R/fair-demographic_parity.R

R/fair-aaa.R

_pkgdown.yml

R/fair-aaa.R

simonpcouch · 2023-05-17T16:57:10Z

R/fair-aaa.R

+diff_range <- function(x, ...) {
+  estimates <- x$.estimate
+
+  max(estimates) - min(estimates)


Add na.rm to both of these.

Possibly? Should test interactions with na_rm here.

R/fair-aaa.R

simonpcouch · 2023-05-17T17:04:22Z

R/fair-aaa.R

+#' output of this function can be used.
+#'
+#' @examples
+#' data(hpc_cv)


If we decide to move forward with a fairness-oriented name, it'd be great if we could use some example data here that has a plausibly "sensitive" attribute. yardstick doesn't Suggests modeldata at the moment, which has some options. infer::gss would also work well here.

R/fair-aaa.R

simonpcouch · 2023-06-21T15:48:06Z

re: #434 (comment)

I think it's fine, even preferable, that the interface makes clear that roc_auc and demographic_parity(cyl) are the same type of thing (and thus roc_auc and demographic_parity are different). metric_set is itself a function factory, so I don't think it's unrealistic for us to expect our users to be able to intuit with functions that output functions.

How can we differentiate the classes of metric functions to avoid the appearance of discordant inputs (function names vs executed functions).

I'm not on board for making e.g. roc_auc and demographic_parity(cyl) have different classes, but I would definitely be up for giving demographic_parity a special function subclass. In that case, we could check for the mistake:

metric_set(
  roc_auc, 
  demographic_parity
)

...specifically, nudging people to the right documentation to learn how to use the disparity metric functions:

! The input `demographic_parity` is a disparity metric function, and must be associated with a data-column.
* Did you mean to type `demographic_parity(col_name)`?

...where "disparity metric function" is a link to the docs.

now refers to the yardstick functions and `.post` values in docs. also, moves `max_positive_rate_diff()` to the top of the file so that `fairness_metric()` doesn't error out on load.

juliasilge · 2023-06-21T19:18:57Z

FWIW I also think it is a good thing (not a problem) that the syntax emphasizes that demographic_parity(cyl) is the same kind of thing (a function) as roc_auc. Using a subclass would be a good option to help with how folks handle these fairness metrics, and relates back to what I mentioned before about a print method for metrics in general.

EmilHvitfeldt · 2023-06-21T22:52:44Z

I think part of the confusion arrises on how we use these functions.

If we change

metric_set(
  roc_auc, 
  accuracy, 
  demographic_parity(cyl)
)

To the following, then we are no longer using () inside metric_set()

demographic_parrity_cyl <- demographic_parity(cyl)

metric_set(
  roc_auc, 
  accuracy, 
  demographic_parrity_cyl
)

In the end this code is a little tricky because we are using a function factory (demographic_parity()) to create a function that is passed to a different function factory (metric_set()).

I'm happy with the interface as it stands right now. And I think we can get a lot of leeway by using better printing, plus we can add some robust error handling in metric_set()

gives `fairness_metric()` output an additional `metric_factory` class

simonpcouch · 2023-06-22T19:42:32Z

metric_set() will now error (extra) informatively when passed an unevaluated fairness metric function! Interactively, "group-wise metric" is a hyperlink to ?yardstick::new_groupwise_metric().

library(yardstick)
  
metric_set(demographic_parity)
#> Error in `metric_set()`:
#> ! The input `demographic_parity` is a group-wise metric
#>   (`?yardstick::new_groupwise_metric()`) factory and must be passed a data-column
#>   before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`?

metric_set(demographic_parity, equal_opportunity)
#> Error in `metric_set()`:
#> ! The inputs `demographic_parity` and `equal_opportunity` are group-wise
#>   metric (`?yardstick::new_groupwise_metric()`) factories and must be passed a
#>   data-column before addition to a metric set.
#> ℹ Did you mean to type e.g. `demographic_parity(col_name)`?

^{Created on 2023-06-22 with reprex v2.0.2}

R/fair-aaa.R

simonpcouch · 2023-06-26T13:17:54Z

@EmilHvitfeldt Some big-picture for re-review:

I'd like to discuss + make some changes related to #434 (comment) before merging!

The next steps for this work are

a vignette (/article?) in yardstick using a proper example to demonstrate using fairness metrics,
a tidymodels.org blog post with a separate example that uses these metrics while tuning and makes use of custom fairness metrics, and then
a tidyverse blog post giving the bigger picture of this work and linking out to 1) and 2).

If you'd like 1) to happen in this PR, that totally works, otherwise I'll plan to follow up on this PR with a smaller, separate one.

After 1-3, it's time to delve into functionality for mitigation.🏄

EmilHvitfeldt · 2023-06-26T18:36:30Z

I would like the vignette work to happen in a new PR. It will be cleaner that way!

R/fair-aaa.R

simonpcouch · 2023-06-27T17:59:50Z

Realizing that aggregate is misspelled aggregrate---one moment.🙃

EmilHvitfeldt · 2023-10-27T21:17:18Z

Pkgdown CI doesn't work because http://r-project.org/ is down. I'm merging anyways

github-actions · 2023-11-11T00:55:51Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

simonpcouch added 5 commits May 15, 2023 08:43

rename metrics.R -> aaa-metrics.R

b98b2ac

The forthcoming metrics execute these functions via their constructor and thus introduce errors at `load_all()`. We could also instead rename the fairness files to be run last, e.g. `zzz-fair.R`.

implement metric constructor

92c232d

This doesn't actually function independently as a commit, as I've linked out to the metrics made with this function in docs, but is otherwise able to stand on its own.

implement 3 canonical fairness metrics

44ef50b

`demographic_parity()`, `equalized_odds()`, and `equal_opportunity()` are all special cases of `fairness_metrics()`.

validate canonical metrics against fairlearn

c260656

add NEWS entry

31e4a5f

simonpcouch mentioned this pull request May 15, 2023

add support for yardstick fairness metrics tidymodels/tune#684

Merged

simonpcouch added 3 commits May 15, 2023 08:56

add pkgdown entry

5922ae2

skip error context snap pre-4.0 (currently oldrel-4)

7236a14

remove unused test object

60973ba

simonpcouch commented May 15, 2023

View reviewed changes

simonpcouch requested review from EmilHvitfeldt, juliasilge and topepo May 15, 2023 13:44

juliasilge reviewed May 15, 2023

View reviewed changes

EmilHvitfeldt mentioned this pull request May 15, 2023

Printing method for metrics #435

Closed

EmilHvitfeldt reviewed May 15, 2023

View reviewed changes

simonpcouch added 6 commits May 16, 2023 08:31

remove rlang:: and vctrs::

6998108

optimize diff_range() for speed

a9fc2e0

transition from summarize()

27f53e9

remove incomplete phrase

7fe7fb8

highlight function factory as output

bdb7c5a

restore passing ellipses to .post()

ca6d7b8

simonpcouch commented May 17, 2023

View reviewed changes

simonpcouch commented May 18, 2023

View reviewed changes

R/fair-aaa.R Outdated Show resolved Hide resolved

simonpcouch added 6 commits May 18, 2023 13:24

error informatively with redundant grouping

f1d663f

don't pass ellipses to .post

45f57ff

run test on all systems

43037d9

namespace group_by in test

ef7f5a1

update snaps

556f2c5

defer to inputted .fn for metric class

6fad6a4

simonpcouch added 3 commits June 21, 2023 10:58

move max_positive_rate_diff() to where it's used

82717fa

document implementation in @description

a2dfe36

now refers to the yardstick functions and `.post` values in docs. also, moves `max_positive_rate_diff()` to the top of the file so that `fairness_metric()` doesn't error out on load.

abort with condition parent to improve error context

375269b

special-case metric factories in metric_set() checks

76f2d4f

gives `fairness_metric()` output an additional `metric_factory` class

simonpcouch added 2 commits June 22, 2023 14:50

rename fairness_metric() -> new_groupwise_metric()

587ff32

contrast "group-wise" and usual grouped behavior of yardstick metrics

05086ea

simonpcouch mentioned this pull request Jun 22, 2023

use of native pipe incompatible with documented R version tidymodels/probably#119

Closed

simonpcouch commented Jun 26, 2023

View reviewed changes

R/fair-aaa.R Outdated Show resolved Hide resolved

simonpcouch requested a review from EmilHvitfeldt June 26, 2023 13:12

rename new_groupwise_metric() arguments

25a893f

EmilHvitfeldt reviewed Jun 26, 2023

View reviewed changes

R/fair-aaa.R Outdated Show resolved Hide resolved

simonpcouch and others added 3 commits June 26, 2023 15:03

clarify documentation on aggregate arg

4436785

use devl probably

12ff56a

Merge branch 'fairness' of github.com:tidymodels/yardstick into fairness

251f45b

simonpcouch added 3 commits June 27, 2023 13:03

aggregrate -> aggregate

6c0b76f

remove Remotes---probably package is now on CRAN

219562c

correct duplicated description

690e738

EmilHvitfeldt merged commit ce03a94 into main Oct 27, 2023

EmilHvitfeldt deleted the fairness branch October 27, 2023 21:17

simonpcouch mentioned this pull request Oct 29, 2023

add vignette on group-aware and -wise behavior #453

Merged

github-actions bot locked and limited conversation to collaborators Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fairness metrics #434

add fairness metrics #434

simonpcouch commented May 15, 2023

simonpcouch May 15, 2023

simonpcouch May 15, 2023

juliasilge May 15, 2023

EmilHvitfeldt May 15, 2023

simonpcouch May 16, 2023

simonpcouch May 18, 2023

simonpcouch Jun 21, 2023

juliasilge Jun 21, 2023

simonpcouch Jun 21, 2023

juliasilge Jun 21, 2023

juliasilge left a comment

juliasilge May 15, 2023

EmilHvitfeldt left a comment

EmilHvitfeldt May 15, 2023

simonpcouch May 17, 2023

simonpcouch May 18, 2023

simonpcouch May 17, 2023

simonpcouch commented Jun 21, 2023

juliasilge commented Jun 21, 2023

EmilHvitfeldt commented Jun 21, 2023

simonpcouch commented Jun 22, 2023 •

edited

Loading

simonpcouch commented Jun 26, 2023 •

edited by EmilHvitfeldt

Loading

EmilHvitfeldt commented Jun 26, 2023

simonpcouch commented Jun 27, 2023

EmilHvitfeldt commented Oct 27, 2023

github-actions bot commented Nov 11, 2023

add fairness metrics #434

add fairness metrics #434

Conversation

simonpcouch commented May 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliasilge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmilHvitfeldt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpcouch commented Jun 21, 2023

juliasilge commented Jun 21, 2023

EmilHvitfeldt commented Jun 21, 2023

simonpcouch commented Jun 22, 2023 • edited Loading

simonpcouch commented Jun 26, 2023 • edited by EmilHvitfeldt Loading

EmilHvitfeldt commented Jun 26, 2023

simonpcouch commented Jun 27, 2023

EmilHvitfeldt commented Oct 27, 2023

github-actions bot commented Nov 11, 2023

simonpcouch commented Jun 22, 2023 •

edited

Loading

simonpcouch commented Jun 26, 2023 •

edited by EmilHvitfeldt

Loading