-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
base: main
Are you sure you want to change the base?
Conversation
@@ -9674,6 +9674,13 @@ def _where( | |||
if axis is not None: | |||
axis = self._get_axis_number(axis) | |||
|
|||
# We should not be filling NA. See GH#60729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling cond
with True
, since _where()
would finally keep the values in self
alive where its cond
is True
.
Even if I don't fill those values here, _where
would call fillna()
using inplace at the below code. That's also why the result varies depending on whether inpalce=True
or not.
Lines 9695 to 9698 in e3b2de8
# make sure we are boolean | |
fill_value = bool(inplace) | |
cond = cond.fillna(fill_value) | |
cond = cond.infer_objects() |
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna()
here, there might be too many code changes needed. See below codes :
- Need to change the below code so that we don't fill the missing values when caller is
where()
ormask()
. If we don't,fillna()
will fill them withinplace
.
Lines 9695 to 9698 in f1441b2
# make sure we are boolean | |
fill_value = bool(inplace) | |
cond = cond.fillna(fill_value) | |
cond = cond.infer_objects() |
- Need to change the below code as well since
to_numpy()
will fill the missing value usinginplace
when cond is a DataFrame.
Lines 9703 to 9716 in f1441b2
if not isinstance(cond, ABCDataFrame): | |
# This is a single-dimensional object. | |
if not is_bool_dtype(cond): | |
raise TypeError(msg.format(dtype=cond.dtype)) | |
else: | |
for _dt in cond.dtypes: | |
if not is_bool_dtype(_dt): | |
raise TypeError(msg.format(dtype=_dt)) | |
if cond._mgr.any_extension_types: | |
# GH51574: avoid object ndarray conversion later on | |
cond = cond._constructor( | |
cond.to_numpy(dtype=bool, na_value=fill_value), | |
**cond._construct_axes_dict(), | |
) |
- Since
extract_bool_array()
fills the missing values using argna_value=False
atEABackedBlock.where()
, we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where()
.
pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
def where(self, other, cond) -> list[Block]: | |
arr = self.values.T | |
cond = extract_bool_array(cond) | |
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]: | |
""" | |
If we have a SparseArray or BooleanArray, convert it to ndarray[bool]. | |
""" | |
if isinstance(mask, ExtensionArray): | |
# We could have BooleanArray, Sparse[bool], ... | |
# Except for BooleanArray, this is equivalent to just | |
# np.asarray(mask, dtype=bool) | |
mask = mask.to_numpy(dtype=bool, na_value=False) | |
mask = np.asarray(mask, dtype=bool) | |
return mask |
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
FYI, it seems this has already been discussed at #53124 (comment) |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Added fillna at the beginning of _where so that we can fill pd.NA.
Since this is my first PR, please correct me if I'm mistaken. Thanks!