Default int type should be mapped to Int32 in Windows #726

probberechts · 2022-01-09T12:54:12Z

Describe the bug

Pandas handles the default int type differently on Windows and Linux. On Linuxint is interpreted as int64 but on Windows as int32. Since Pandera always maps int to int64, you get unexpected SchemaErrors on Windows. You can read more about it in these issues:

Code Sample

import pandas as pd
import pandera as pa

from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    price: Series[int]

df = pd.DataFrame({ 'price': [8, 12, 10, 16] }).astype(int)

InputSchema.validate(df)

This is fine on Linux, but gives a SchemaError: expected series 'price' to have type int64, got int32 on Windows.

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2022-01-09T16:03:26Z

Thanks for pointing this out @probberechts!

The relevant part of the code is here:
https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L211-L215

I think something like this will defer to the pandas default dtype for a particular python built-in type:

default_pd_dtype = pd.Series([1], dtype=builtin_name).dtype

assert np.dtype(int) == default_pd_dtype  # True
# Windows
assert np.dtype("int32") == default_pd_dtype
# non-Windows
assert np.dtype("int64") == default_pd_dtype

@jeffzi FYI, I forget why we decided to map int -> int64 regardless of OS, but I think we should defer to the behavior of the underlying platform for dtype defaults.

@probberechts would you be open to making a PR for this? Basically need to change that line of code and add an OS-specific unit test here: https://github.com/pandera-dev/pandera/blob/master/tests/core/test_dtypes.py

jeffzi · 2022-01-10T14:55:55Z

I forget why we decided to map int -> int64 regardless of OS

I can't remember either. It could be a workaround from when we were fixing the windows CI.

I agree the default int should follow pandas and pandas_engine should not hardcode the size.

That said, we do test that default pandera int matches pandas: https://github.com/pandera-dev/pandera/blob/9448d0a80b8dd02910f9cc553ce00349584b107f/tests/core/test_dtypes.py#L406-L411

The problem is that the "implicit" default for a pandas Series is int64 but astype(int) will apply numpy's default, i.e. int32 on windows.

import platform
import sys
import numpy as np
import pandas as pd

print(platform.system())
#> Windows
print(sys.version)
#> 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
print(pd.__version__)
#> 1.3.4
print(np.__version__)
#> 1.21.2

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

If we have int32 as default on windows, then the validation will fail when the user does not explicitly cast the series to int (e.g.: pd.Dataframe(..., dtype=int) or astype(int)). I think that's fine but we should explain this inconsistency in the documentation.

Midnighter · 2023-04-01T15:41:48Z

This is quite annoying. I tried to fix it by explicitly defining

class Schema(pa.SchemaModel):
    price: Series[numpy.int64]

    class Config:
       coerce = True

but that didn't help either 🙁

joaoe · 2023-04-17T13:50:41Z

Howdy.
In my project we have people working on Linux and a few on Windows. This causes problems for us.
So far I worked around by monkeypatching Int

from pandera.dtypes import Int
Int.check = lambda self, pandera_dtype, data_container=None: isinstance(pandera_dtype, Int)

Having such work around in the project is ugly, but the fix itself is not too strange.

The default generic Int type shold allow any of int8, int16, int32 or int64. It is just Int. There are other IntN classes for the specific bid widths. But if I have a Schema with just column: int I'm not too interested in pandera checking for bid width.
Perhaps, the rules in the Int class should be relaxed ?

EDIT: for reference, this is the numpy bug numpy/numpy#9464

DrShushen · 2023-04-20T12:07:20Z

@joaoe
I agree, I've had to write custom workarounds in some of my projects to essentially accomplish this on Windows:
https://github.com/vanderschaarlab/temporai/blob/a6bb11d63341c4947f940e3d075e8a77be06b898/src/tempor/data/pandera_utils.py#L134-L140

cosmicBboy · 2023-05-07T23:09:30Z

So the problem here is stated by @jeffzi #726 (comment):

On Windows:

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

It seems reasonable that pd.Series([1], dtype=int) should match pd.SeriesSchema(dtype=int). But it's not ideal that the implicit integer type of pd.Series([1]) is int64, so pd.SeriesSchema(dtype=int).validate(pd.Series([1])) will fail, which I'm sure will still be surprise many users. I think this will just have to be documented clearly somewhere.

There are 130 failing tests as a result of this PR fix: https://github.com/unionai-oss/pandera/actions/runs/4909424667/jobs/8765820508?pr=1179

If anyone on this thread so far has the bandwidth to fix all of these breaking tests on windows, that would be much appreciated! The code changes to fix this issue are already on #1179, just need to update the tests to explicitly use dtype=int in tests cases that rely on the assumption that pd.Series([1]) does the right thing.

probberechts added the bug Something isn't working label Jan 9, 2022

probberechts mentioned this issue Jan 9, 2022

SchemaError when converting Wyscout events to SPADL actions ML-KULeuven/socceraction#157

Closed

joaoe mentioned this issue Apr 17, 2023

Default int type is platform dependent numpy/numpy#9464

Closed

cosmicBboy added the help wanted Extra attention is needed label Apr 17, 2023

cosmicBboy linked a pull request May 7, 2023 that will close this issue

use int32 as default int representation in windows #1179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default int type should be mapped to Int32 in Windows #726

Default int type should be mapped to Int32 in Windows #726

probberechts commented Jan 9, 2022

cosmicBboy commented Jan 9, 2022

jeffzi commented Jan 10, 2022 •

edited

Loading

Midnighter commented Apr 1, 2023

joaoe commented Apr 17, 2023 •

edited

Loading

DrShushen commented Apr 20, 2023

cosmicBboy commented May 7, 2023 •

edited

Loading

Default int type should be mapped to Int32 in Windows #726

Default int type should be mapped to Int32 in Windows #726

Comments

probberechts commented Jan 9, 2022

Code Sample

cosmicBboy commented Jan 9, 2022

jeffzi commented Jan 10, 2022 • edited Loading

Midnighter commented Apr 1, 2023

joaoe commented Apr 17, 2023 • edited Loading

DrShushen commented Apr 20, 2023

cosmicBboy commented May 7, 2023 • edited Loading

jeffzi commented Jan 10, 2022 •

edited

Loading

joaoe commented Apr 17, 2023 •

edited

Loading

cosmicBboy commented May 7, 2023 •

edited

Loading