Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series[Timestamp].diff() should return Series[Timedelta] #821

Closed
bersbersbers opened this issue Nov 26, 2023 · 13 comments · Fixed by #907
Closed

Series[Timestamp].diff() should return Series[Timedelta] #821

bersbersbers opened this issue Nov 26, 2023 · 13 comments · Fixed by #907

Comments

@bersbersbers
Copy link

bersbersbers commented Nov 26, 2023

Describe the bug
Series[Timestamp].diff() returns Series[Timedelta] but stubs indicate it's Series[Timestamp] instead.

To Reproduce

# python bug.py & mypy bug.py & pyright bug.py
from __future__ import annotations

from typing import TYPE_CHECKING, reveal_type

import pandas as pd

if TYPE_CHECKING:
    from pandas.core.series import TimedeltaSeries, TimestampSeries

def fun1(times: pd.Series[pd.Timestamp]) -> pd.Series[pd.Timedelta]:
    reveal_type(times)
    reveal_type(times.diff())
    print(type(times[1]))
    print(type(times.diff()[1]))
    return times.diff()

def fun2(times: TimestampSeries) -> TimedeltaSeries:
    reveal_type(times)
    reveal_type(times.diff())
    print(type(times[1]))
    print(type(times.diff()[1]))
    return times.diff()

times = pd.Series([pd.Timestamp(0), pd.Timestamp(0)])
fun1(times)
fun2(times)

Output:

Runtime type is 'Series'
Runtime type is 'Series'
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
Runtime type is 'Series'
Runtime type is 'Series'
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
bug.py:13: note: Revealed type is "pandas.core.series.Series[pandas._libs.tslibs.timestamps.Timestamp]"
bug.py:14: note: Revealed type is "pandas.core.series.Series[pandas._libs.tslibs.timestamps.Timestamp]"
bug.py:17: error: Incompatible return value type (got "Series[Timestamp]", expected "Series[Timedelta]")  [return-value]
bug.py:21: note: Revealed type is "pandas.core.series.TimestampSeries"
bug.py:22: note: Revealed type is "pandas.core.series.Series[pandas._libs.tslibs.timestamps.Timestamp]"
bug.py:25: error: Incompatible return value type (got "Series[Timestamp]", expected "TimedeltaSeries")  [return-value]
Found 2 errors in 1 file (checked 1 source file)
C:\Build\project\bug.py
  C:\Build\project\bug.py:13:17 - information: Type of "times" is "Series[Timestamp]"
  C:\Build\project\bug.py:14:17 - information: Type of "times.diff()" is "Series[Timestamp]"
  C:\Build\project\bug.py:17:12 - error: Expression of type "Series[Timestamp]" cannot be assigned to return type "Series[Timedelta]"
    "Series[Timestamp]" is incompatible with "Series[Timedelta]"
      Type parameter "S1@Series" is invariant, but "Timestamp" is not the same as "Timedelta" (reportGeneralTypeIssues)
  C:\Build\project\bug.py:21:17 - information: Type of "times" is "TimestampSeries"
  C:\Build\project\bug.py:22:17 - information: Type of "times.diff()" is "Series[Timestamp]"
  C:\Build\project\bug.py:25:12 - error: Expression of type "Series[Timestamp]" cannot be assigned to return type "TimedeltaSeries"
    "Series[Timestamp]" is incompatible with "TimedeltaSeries" (reportGeneralTypeIssues)
2 errors, 0 warnings, 4 informations

Please complete the following information:

  • Windows 10 22H2
  • Python 3.12.0
  • mypy==1.7.1, pyright==1.1.337, pandas-stubs==2.1.1.230928

Additional context
I tried both Series[Timestamp] and TimestampSeries due to #673 (comment)
Possibly related: #718

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Nov 26, 2023

Thanks for the report. I think we have to add diff() as a method to TimestampSeries .

@mutricyl
Copy link
Contributor

mutricyl commented Apr 3, 2024

looking at series.diff documentation it looks like diff of Series[int] returns Series[float] and I made a test to confirm this. Current stubs returns Series[int]. I am not sure this is a correct behavior of pandas?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Apr 3, 2024

looking at series.diff documentation it looks like diff of Series[int] returns Series[float] and I made a test to confirm this. Current stubs returns Series[int]. I am not sure this is a correct behavior of pandas?

Yes, it seems that's an error in the stub for Series.diff(). So you'd have to make 2 changes to address this. One in Series.diff(), so it returns Series[float]. The other in TimestampSeries so that it returns TimedeltaSeries

@mutricyl
Copy link
Contributor

mutricyl commented Apr 3, 2024

We have to manage objects also from bool and pd.Period. pd.Interval will require a Never

float ➡️ float
int ➡️ float
uint ➡️ float
timestamp ➡️ timedelta
period ➡️ object
timedelta ➡️ timedelta64
interval ➡️ TypeError: IntervalArray has no 'diff' method. Convert to a suitable dtype prior to calling 'diff'.
bool ➡️ object
object ➡️ object

Any other object people can think of before starting to code ?

mutricyl pushed a commit to mutricyl/pandas-stubs that referenced this issue Apr 3, 2024
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Apr 3, 2024

Any other object people can think of before starting to code ?

You'd have to look at the definition of S1 to make sure you've handled all the cases.

You might want to look at how we handle Series.__sub__() as a guide. Whereever you see def __sub__( in series.pyi, I think you'd have a similar overload for Series.diff()

Thanks for looking into this. It's non-trivial.

@mutricyl
Copy link
Contributor

mutricyl commented Apr 6, 2024

For info, if it is ok, I'll need to create new classes in series.pyi for Series[bool], Series[bytes], Series[complex] and Series[dtype]

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Apr 9, 2024

For info, if it is ok, I'll need to create new classes in series.pyi for Series[bool], Series[bytes], Series[complex] and Series[dtype]

You shouldn't have to do this. Just follow the pattern used for Series.__sub__() and other __sub__() declarations in series.pyi

@mutricyl
Copy link
Contributor

Almost finished fixing this issue in https://github.com/mutricyl/pandas-stubs/tree/821-series_diff-timestamp2

however I have got an issue with check function when running pytests:

>       check(
            assert_type(
                pd.Series(
                    [datetime.datetime.now().date(), datetime.datetime.now().date()]
                ).diff(),
                "TimedeltaSeries",
            ),
            pd.Series,
            pd.Timedelta,
        )

This test fails in pytest with:

RuntimeError: Expected type '<class 'pandas._libs.tslibs.timedeltas.Timedelta'>' but got '<class 'pandas._libs.tslibs.nattype.NaTTyp
>>> pd.Series( [datetime.datetime.now().date(), datetime.datetime.now().date()]  ).diff()
0      NaT
1   0 days
dtype: timedelta64[ns]

Looking at check function source code for series it uses the first item to determine the type:

    if isinstance(actual, pd.Series):
        value = actual.iloc[0]

first item of .diff() series is always some kind of None/NaN/NaT... altough type(NaN) is float which make it work for common case, type(NaT) is not timedelta or similar.

Any advise on how to solve this particular case @Dr-Irv ?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Apr 17, 2024

Very interesting use case where how we designed check() doesn't work for Series.diff().

Do the following: Introduce a parameter to check() called index_to_check_for_type: Literal[0,-1] = 0 and use actual.iloc[index_to_check_for_type] instead of actual.iloc[0] in the line you quoted. Then for your test, you just add the parameter index_to_check_for_type = -1 in the check() call and the existing calls will all work, and yours should as well.

Dr-Irv pushed a commit that referenced this issue Apr 26, 2024
* Adding tests for #821

* finish to add tests from  pd._types.S1 definition

* start fixing stubs

* date le lt ge gt fixing

* new diff overloads

* adding a few Never

* fix: check function updated for series.diff

* clean comments and update period test

* cover windows specific RuntimeWarning

* PeriodSeries.diff is OffsetSeries

* remove comments

* try tables 3.9.2 to fix macos github test

---------

Co-authored-by: Laurent Mutricy <[email protected]>
@bersbersbers
Copy link
Author

#907 closed this, but I am still seeing issues with the code example above. I minimized the code into this example:

# python bug.py & mypy bug.py & pyright bug.py
from __future__ import annotations

from typing import reveal_type

import pandas as pd

def fun(stamps: pd.Series[pd.Timestamp]) -> pd.Series[pd.Timedelta]:
    reveal_type(stamps)
    print(type(stamps[1]))

    deltas = stamps.diff()

    reveal_type(deltas)
    print(type(deltas[1]))

    return deltas

times = pd.Series([pd.Timestamp(0), pd.Timestamp(0)])
fun(times)

python says:

Runtime type is 'Series'
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Runtime type is 'Series'
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>

So runtime behavior is as expected.

However, mypy (as pyright) thinks deltas is a series of float still:

bug.py:9: note: Revealed type is "pandas.core.series.Series[pandas._libs.tslibs.timestamps.Timestamp]"
bug.py:14: note: Revealed type is "pandas.core.series.Series[builtins.float]"
bug.py:17: error: Incompatible return value type (got "Series[float]", expected "Series[Timedelta]")  [return-value]
Found 1 error in 1 file (checked 1 source file)

This is with version 2.2.2.240514.

@bersbersbers
Copy link
Author

bersbersbers commented May 15, 2024

As a workaround, I could replace pd.Series[pd.Timestamp] by TimestampSeries in my code, but then the type checkers don't allow this assignment: also, #718 seems to imply pd.Series is to prefer.

Before (working):

times: pd.Series[pd.Timestamp] = df[TIME_COL]

After (not working):

times: TimestampSeries = df[TIME_COL]

error: Incompatible types in assignment (expression has type "Series[Any]", variable has type "TimestampSeries")

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 15, 2024

As a workaround, I could replace pd.Series[pd.Timestamp] by TimestampSeries in my code, but then the type checkers don't allow this assignment:

That's the right workaround for your example

also, #718 seems to imply pd.Series is to prefer.

We had to introduce TimestampSeries and TimedeltaSeries to get around some issues with how mypy processes operators on generic classes. That's why #718 is there - to see if that is still necessary, since mypy has evolved since then. Having said that, making consistent behavior between pyright and mypy on this issue is a challenge. See microsoft/pyright#5216 (comment)

Before (working):

times: pd.Series[pd.Timestamp] = df[TIME_COL]

The reason that works is that we can't track the types of columns within a DataFrame. So the type of df[TIME_COL] is a Series without a tracked dtype, and you're then assigning it to a Series with a known dtype. If you wrote times: pd.Series[int] = df[TIME_COL], that would be accepted as well.

After (not working):

times: TimestampSeries = df[TIME_COL]

This is again related to how the type checker is handling the generic aspect of Series in typing, and that df[TIME_COL] doesn't have a dtype associated.

In general, whenever you refer to a column of a DataFrame, via df[TIME_COL], you can't assume the type and have to do a cast. So what would work is:

times = cast("TimestampSeries", df[TIME_COL])

@bersbersbers
Copy link
Author

That's the right workaround for your example
...

times = cast("TimestampSeries", df[TIME_COL])

So one cast needs to be necessary at least. Accepting that, it makes sense doing it as early as possible to catch errors in the processing. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants