-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default int type should be mapped to Int32 in Windows #726
Comments
Thanks for pointing this out @probberechts! The relevant part of the code is here: I think something like this will defer to the pandas default dtype for a particular python built-in type: default_pd_dtype = pd.Series([1], dtype=builtin_name).dtype
assert np.dtype(int) == default_pd_dtype # True
# Windows
assert np.dtype("int32") == default_pd_dtype
# non-Windows
assert np.dtype("int64") == default_pd_dtype @jeffzi FYI, I forget why we decided to map @probberechts would you be open to making a PR for this? Basically need to change that line of code and add an OS-specific unit test here: https://github.com/pandera-dev/pandera/blob/master/tests/core/test_dtypes.py |
I can't remember either. It could be a workaround from when we were fixing the windows CI. I agree the default int should follow pandas and That said, we do test that default pandera int matches pandas: https://github.com/pandera-dev/pandera/blob/9448d0a80b8dd02910f9cc553ce00349584b107f/tests/core/test_dtypes.py#L406-L411 The problem is that the "implicit" default for a pandas Series is int64 but import platform
import sys
import numpy as np
import pandas as pd
print(platform.system())
#> Windows
print(sys.version)
#> 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
print(pd.__version__)
#> 1.3.4
print(np.__version__)
#> 1.21.2
print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32 If we have int32 as default on windows, then the validation will fail when the user does not explicitly cast the series to |
This is quite annoying. I tried to fix it by explicitly defining class Schema(pa.SchemaModel):
price: Series[numpy.int64]
class Config:
coerce = True but that didn't help either 🙁 |
Howdy.
Having such work around in the project is ugly, but the fix itself is not too strange. The default generic Int type shold allow any of int8, int16, int32 or int64. It is just Int. There are other IntN classes for the specific bid widths. But if I have a Schema with just EDIT: for reference, this is the numpy bug numpy/numpy#9464 |
@joaoe |
So the problem here is stated by @jeffzi #726 (comment): On Windows: print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32 It seems reasonable that There are 130 failing tests as a result of this PR fix: https://github.com/unionai-oss/pandera/actions/runs/4909424667/jobs/8765820508?pr=1179 If anyone on this thread so far has the bandwidth to fix all of these breaking tests on windows, that would be much appreciated! The code changes to fix this issue are already on #1179, just need to update the tests to explicitly use |
Describe the bug
Pandas handles the default
int
type differently on Windows and Linux. On Linuxint
is interpreted asint64
but on Windows asint32
. Since Pandera always mapsint
toint64
, you get unexpectedSchemaError
s on Windows. You can read more about it in these issues:int
dtype on Linux/Windows pandas-dev/pandas#44925Code Sample
This is fine on Linux, but gives a
SchemaError: expected series 'price' to have type int64, got int32
on Windows.The text was updated successfully, but these errors were encountered: