Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect warning about clipping after adding wide xlim values to geom_histogram #5118

Open
cpsyctc2 opened this issue Dec 24, 2022 · 2 comments · May be fixed by #6323
Open

Incorrect warning about clipping after adding wide xlim values to geom_histogram #5118

cpsyctc2 opened this issue Dec 24, 2022 · 2 comments · May be fixed by #6323

Comments

@cpsyctc2
Copy link

Brief description of the problem

If I add xlim() or limits in scale_x_continuous() using geom_histogram and setting the limits outside the range of the data I see a warning message:
Removed 2 rows containing missing values (geom_bar()).
but in fact nothing has been removed.

True for me on:
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

and

R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

and

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

All three show ggplot2 at version 3.4.0

Reprex follows.

library(tidyverse)
set.seed(12345)
tibble(x = rnorm(5000) / 10) -> tmpTib
  
tmpTib %>% 
  summarise(min = min(x),
            max = max(x),
            nNA = sum(is.na(x)))
# # A tibble: 1 × 3
# min   max   nNA
# <dbl> <dbl> <int>
#   1 -0.388 0.333     0
### so no missing values and range well inside [-1, 1]
  
ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram()
### plots all 5000 points

ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram() +
  xlim(-1, 1)
### reports:
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`). 

### same happens using scale_x_continuous(limits = c(-1, 1)):
ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(-1, 1))
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`). 

tmpTib %>%
  filter(row_number() < 6) -> tmpTibSmall

tmpTibSmall
# # A tibble: 5 × 1
# x
# <dbl>
# 1  0.0586
# 2  0.0709
# 3 -0.0109
# 4 -0.0453
# 5  0.0606

### using small dataset shows that there is actually no removal of data
ggplot(data = tmpTibSmall,
       aes(x = x)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(-.07, .085)) 

ggplot(data = tmpTibSmall,
       aes(x = x)) + 
  geom_histogram() +
  xlim(-.07, .085)

sessionInfo()

I hope I'm not being stupid!

@teunbrand
Copy link
Collaborator

Because the scale range is larger than the data range, this results in some empty, 0-count, bins at the flanks of the histograms. If these flanking bins are out-of-bounds, they get censored and dropped, which is the warning you get.

We can show in the layer data that there is an empty bin at the start and end that have NAs for either xmin or xmax (because they got censored).

library(ggplot2)
set.seed(12345)
df <- data.frame(x = rnorm(5000) / 10)

p <- ggplot(data = df, aes(x = x)) + 
  geom_histogram() +
  xlim(-1, 1)

ld <- layer_data(p)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
head(ld, 2)[, 1:8]
#>   y count          x xmin       xmax density ncount ndensity
#> 1 0     0         NA   NA -1.0000000       0      0        0
#> 2 0     0 -0.9655172   -1 -0.9310345       0      0        0
tail(ld, 2)[, 1:8]
#>    y count         x      xmin      xmax density ncount ndensity
#> 29 0     0 0.8965517 0.8620690 0.9310345       0      0        0
#> 30 0     0 0.9655172 0.9310345        NA       0      0        0

Created on 2022-12-24 by the reprex package (v2.0.1)

This is all how it is supposed to work, but what I don't understand (yet) is why the breaks for the bins get calculated outside the bounds of the scale range.

If you want to remedy this issue, you could use scale_x_continuous(limits = c(-1, 1), oob = scales::oob_keep) to keep the out-of-bounds empty bins. If you'd use coord_cartesian(xlim = c(-1, 1)), it will change the break calculation to fit the data instead of the scale range.

@cpsyctc2
Copy link
Author

Wow. Brilliant answer: many thanks. I suspect I should have been able to work this out myself, perhaps if I had tried coord_cartesian() that would have tipped me off. I hadn't found oob. I share your puzzlement, now I am starting to understand what's happening, about the setting of the bin limits. I can't see how that's a good choice. (I accept there often are good explanations for things in R that I haven't understood until I have thought about them a lot!) However, if there is a good reason that's escaping us I suspect it would be an improvement to have geom_histogram() throw a warning about what's happening and why. (I guess that the warning that is coming out is not coming from within geom_histogram() but somewhere "deeper" in ggplot(). Fascinating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants