Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): extension array indexers #9671

Open
wants to merge 222 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

Identical to kmuehlbauer#1 - probably not very helpful in terms of changes since https://github.com/kmuehlbauer/xarray/tree/any-time-resolution-2 contains most of it....

kmuehlbauer and others added 30 commits October 18, 2024 07:31
…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution
…t resolution, fix code and tests to allow this
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
…rray` series creating an extension array when `.array` is accessed
@ilan-gold
Copy link
Contributor Author

@dcherian @benbovy @Illviljan anything left here?

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few comments and suggestions. I didn't tested it, though.

) -> np.ndarray:
if dtype is None:
dtype = self.dtype
if pd.api.types.is_extension_array_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be cleaner?

        if dtype is None and is_valid_numpy_dtype(self.dtype):
            dtype = self.dtype

This will just let numpy set the appropriate dtype when coercing the pandas.Index.

Just a quick check that default output dtypes make sense for pd.CategoricalIndex and pd.PeriodIndex:

>>> import numpy as np
>>> import pandas as pd
>>> np.__version__
'2.0.2
>>> pd.__version__
'2.2.3'
>>> cidx = pd.CategoricalIndex(["a"])
>>> np.asarray(cidx.values, dtype=None).dtype
dtype('O')
>>> cidx2 = pd.CategoricalIndex([1])
>>> np.asarray(cidx2.values, dtype=None).dtype
dtype('int64')
>>> pidx = pd.PeriodIndex([2022], freq="Y")
>>> np.asarray(pidx.values, dtype=None).dtype
dtype('O')

@@ -1118,7 +1118,8 @@ def test_groupby_math_nD_group() -> None:
expected = da.isel(x=slice(30)) - expanded_mean
expected["labels"] = expected.labels.broadcast_like(expected.labels2d)
expected["num"] = expected.num.broadcast_like(expected.num2d)
expected["num2d_bins"] = (("x", "y"), mean.num2d_bins.data[idxr])
# mean.num2d_bins.data is a pandas IntervalArray so needs to be put in `numpy` to allow indexing
expected["num2d_bins"] = (("x", "y"), mean.num2d_bins.data.to_numpy()[idxr])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically backwards-incompatible, but an improvement IMO. Just noting in case someone looks this up in the future.

Before:

num2d_bins
mean.num2d_bins
<xarray.DataArray 'num2d_bins' (num2d_bins: 2)> Size: 16B
array([Interval(0, 4, closed='right'), Interval(4, 6, closed='right')],
      dtype=object)
Coordinates:
  * num2d_bins  (num2d_bins) object 16B (0, 4] (4, 6]

After:

ipdb> mean.num2d_bins
mean.num2d_bins
<xarray.DataArray 'num2d_bins' (num2d_bins: 2)> Size: 16B
array([Interval(0, 4, closed='right'), Interval(4, 6, closed='right')],
      dtype=object)
Coordinates:
  * num2d_bins  (num2d_bins) interval[int64, right] 16B (0, 4] (4, 6]

@@ -834,6 +834,7 @@ def chunk(
if chunkmanager.is_chunked_array(data_old):
data_chunked = chunkmanager.rechunk(data_old, chunks) # type: ignore[arg-type]
else:
ndata: duckarray[Any, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the pandas-specific code, I'm not sure we should do that, we might as well just ask the user to cast.

* main:
  Vendor pandas to xarray conversion tests (pydata#10187)
  Fix: Correct axis labelling with units for FacetGrid plots (pydata#10185)
  Use explicit repo name in upstream wheels (pydata#10181)
  DOC: Update docstring to reflect renamed section (pydata#10180)
@@ -104,17 +104,11 @@ def index_flat(request):
index fixture, but excluding MultiIndex cases.
"""
key = request.param
if key in ["bool-object", "bool-dtype", "nullable_bool", "repeats"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be some weird broadcasting behaviour here.

@dcherian
Copy link
Contributor

Sorry, this is a total mess. Apparently IndexVariable and Variable now behave differently, and I'm not sure why.

@@ -945,7 +944,7 @@ def load(self, **kwargs):
--------
dask.array.compute
"""
self._data = to_duck_array(self._data, **kwargs)
self._data = _maybe_wrap_data(to_duck_array(self._data, **kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just return the PandasExtensionArray wrapper class but I'm wary of exposing that to users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants