Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction plots for logistic models: improvements #120

Open
DominiqueMakowski opened this issue Jun 9, 2021 · 16 comments
Open

Prediction plots for logistic models: improvements #120

DominiqueMakowski opened this issue Jun 9, 2021 · 16 comments
Labels
Plot 🎇 Something related to plotting

Comments

@DominiqueMakowski
Copy link
Member

The current default plot for logistic models is like that:

p <- plot(modelbased::estimate_relation(glm(vs ~ mpg, data = mtcars, family = "binomial")))
p

However, when the outcome variable is not on the 0-1 scale, but is a factor (and thus 1-2 by default), the plot is messed up:

p <- plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))
p

Created on 2021-06-09 by the reprex package (v1.0.0)

What would be the most elegant solution to plot the line on the outcome's scale?


On a related note, I'm thinking about moving from a pseudo-rug geoms for data points towards maybe something that would look like this:

image

What's the best way to create something like that? Using ggdists?

@DominiqueMakowski DominiqueMakowski added the Bug 🐛 Something isn't working label Jun 9, 2021
@mattansb
Copy link
Member

mattansb commented Jun 9, 2021

However, when the outcome variable is not on the 0-1 scale, but is a factor (and thus 1-2 by default), the plot is messed up:

if (outcome_is_factor) {
  predicted <- predicted + 1
}

DominiqueMakowski added a commit that referenced this issue Jun 10, 2021
@DominiqueMakowski
Copy link
Member Author

The main issue is fixed using your suggestion Mattan, even though I'm not sure whether it's the best solution to modify the predictions rather than, somehow, the levels. Because, for instance, it also messes up the position of the dotdensity.

On top of that, I'm not sure how to pick the best default values for the dotdensity so that the plots are fairly consistent (the dots have the same size or the same "height" ) in respect to the amount of data.

library(ggplot2)

p <- plot(modelbased::estimate_relation(glm(vs ~ mpg, data = mtcars, family = "binomial")))
p +
  geom_dotplot(
    aes_string(x = "mpg", y = "vs"),
    data = mtcars[mtcars[["vs"]] == 0, ],
    method = "dotdensity",
    binwidth = 1/30 * diff(range(mtcars[["mpg"]])),
    stackdir = "up",
    show.legend = FALSE)

p <- plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))
p +
  geom_dotplot(
    aes_string(x = "body_mass_g", y = "sex"),
    data = palmerpenguins::penguins[palmerpenguins::penguins[["sex"]] == "female", ],
    method = "dotdensity",
    binwidth = 1/30 * diff(range(palmerpenguins::penguins[["body_mass_g"]], na.rm = TRUE)),
    stackdir = "up",
    show.legend = FALSE)
#> Warning: Removed 11 rows containing non-finite values (stat_bindot).

Created on 2021-06-10 by the reprex package (v1.0.0)

@mattansb
Copy link
Member

These are the two options I can think of...

library(ggplot2)

fct <- mtcars$am |> factor()
x <- mtcars$mpg
pred <- runif(length(x))

ggplot(mapping = aes(x, fct)) + 
  geom_point() + 
  geom_point(aes(y = pred + 1), color = "red")

ggplot(mapping = aes(x, as.numeric(fct) - 1)) + 
  scale_y_continuous(breaks = c(0, 1), labels = levels(fct)) + 
  geom_point() + 
  geom_point(aes(y = pred), color = "red")

Created on 2021-06-10 by the reprex package (v2.0.0)

@bwiernik
Copy link
Contributor

What do you think of the viewport overlay I linked to?

@DominiqueMakowski
Copy link
Member Author

What do you think of the viewport overlay I linked to?

What do you mean?

@bwiernik
Copy link
Contributor

That code produces image

@DominiqueMakowski
Copy link
Member Author

DominiqueMakowski commented Jun 13, 2021

Yes this is my goal, it is the same as above in my example; i.e., it uses ggplot2::geom_dotplot. The problem of that geom is that it doesn't have afaik good default adjustments for the size of the dots, it must be specified manually to look really nice (as in the figure you mention). So my question was whether we can find some rules to adjust the dotplot parameters so that the plots look consistent (so that the user don't have to manually specificy the binwindth / dots-size)

@bwiernik
Copy link
Contributor

@mjskay you are the leading expert on dotplot scaling. Any ideas?

@mjskay
Copy link

mjskay commented Jun 14, 2021

sure, here's a quick example with ggdist::geom_dots, which is designed for exactly this kind of problem :)

# took me a minute to get this working since your master branch appears broken?
p <- plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))

p +
  ggdist::geom_dots(
    aes_string(x = "body_mass_g", y = "sex"),
    data = na.omit(palmerpenguins::penguins[palmerpenguins::penguins[["sex"]] == "female", ]),
    color = "black",
    fill = "gray50",
    alpha = 0.5,
    scale = 0.45,
    side = "top"
  ) +
  ggdist::geom_dots(
    aes_string(x = "body_mass_g", y = "sex"),
    data = na.omit(palmerpenguins::penguins[palmerpenguins::penguins[["sex"]] == "male", ]),
    color = "black",
    fill = "gray50",
    alpha = 0.5,
    scale = 0.45,
    side = "bottom"
  ) +
  theme_light()

image

The use of scale = 0.45 here causes it to pick a binwidth that makes the dotplot at most 45% of the space between items on the categorical y axis, so you should be guaranteed at least 10% space between the tips of the dotplots. The binwidth adjusts to maintain this distance if you resize the plot.

Incidentally writing this example this caused me to find a bug (mjskay/ggdist#74) in the NA handling in ggdist::geom_dots() when the y axis has manually-defined limits, so thanks :) (should be a quick fix).

@DominiqueMakowski
Copy link
Member Author

Awesome, thanks a lot for your input! ☺️ Let's go with ggdist's solution then

@mjskay
Copy link

mjskay commented Jun 14, 2021

Happy to help! :)

Incidentally I fixed the NA handling bug in geom_dots, so the github version should now do the correct thing given na.rm = TRUE, which you can use instead of na.omit() as in the example I gave above.

@mjskay
Copy link

mjskay commented Jun 14, 2021

Also, if you find any weird corner cases where the automatic bin width detection doesn't work well, please let me know --- it's a hairy problem and improvements have really been driven by real-world examples of when it doesn't work. I'm happy to make further improvements!

@sven-stodtmann
Copy link

It seems I am late to the party... but in my field (pharmacometrics) the standard diagnostic plot for logistic regressions is an overlay with binned observations (often "quartiles" -- actually fourths, but they're called quartiles....)
plot_logreg(glm(vs ~ mpg, data = mtcars, family = "binomial"),terms='mpg',observed=cut_x(4))
image

I'm not sure how something like this would work with the dotplots

ds <- mtcars%>%mutate(am=factor(am))
plt_rng <- seq(1,40)
plot_logreg(glm(vs ~ mpg+am, data = ds, family = "binomial"),terms=c('mpg [plt_rng]','am'),observed=cut_x(2),by='am')

image

they're not strictly goodness of fit plots, but they give some sense of how the model predicts the data, especially if all relevant predictors can be visualized.

Internally, this uses both ggeffects and emmeans, I am not sure if all functionality can be also created with modelbased, but if it could and this type of plot can be natively supported, that would be cool.
maybe a geom_binom_binned()? But I imagine it would be hard to keep track of the facets/groups.

@strengejacke
Copy link
Member

I think the initial issue is fixed, we "just" need to add the ggdist-stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Plot 🎇 Something related to plotting
Projects
None yet
Development

No branches or pull requests

6 participants