-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(breadbox): Refactor tabular dataset schema error #183
base: master
Are you sure you want to change the base?
Conversation
), # annotation_type_to_pandera_column_type(v.col_type), | ||
nullable=False if dimension_type_identifier == k else True, | ||
annotation_type_to_pandas_column_type(v.col_type), | ||
coerce=True, # SchemaErrorReason: DATATYPE_COERCION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reading csv into df with expected dtype, check within the schema so that values in df column that are not of the assigned dtype raises a SchemaError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I do still have a bit of worry about this.
I think there's an edge case. I'd be most comfortable if you read all columns saying to expect "str" and then used the validator to coerce them into the right type.
I worry about unexpected conversions and the way that there might be some mangling along the way.
Here's a concrete example, image a csv which looks like:
v
1
2.1
Now, for column "v", I'd expect the values to be ["1", "2.1"]
but this process will yield ["1.0", "2.1"]
.
Here's the code I used to verify this behavior:
>>> df = pd.read_csv(io.StringIO("v\n1\n2.1"))
>>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"])
['1.0', '2.1']
If you read in all columns as strings first, then there's no lossy transform and I think the problem is avoided:
>>> df = pd.read_csv(io.StringIO("v\n1\n2.1"), dtype=str)
>>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"])
['1', '2.1']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgm Is there really an issue with storing the values1.0
vs 1
? I believe both the matrix and the tabular dataset validations coerces value types continuous
to Float64
. I personally liked standardizing all numbers to floats but let me know if there is a specific reason we should preserve original values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind I understand the issue now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Error message returned to client is an obscure pandera error. Change error message to be more intuitive to users
This PR addresses the original task's use case. While completing this task, I noticed many pandera error codes are obscure and that stringifying the SchemaError often gives a better error message. However, in some cases doing so messes up the formatting of the message in the frontend so that is a TODO that I will address later. In addition, it is difficult to catch every single schema error case but I have noted down the types of errors I have accounted for and expect the schema validation to catch so far..