Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(breadbox): Refactor tabular dataset schema error #183

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jessica-cheng
Copy link
Contributor

@jessica-cheng jessica-cheng commented Jan 29, 2025

Error message returned to client is an obscure pandera error. Change error message to be more intuitive to users

This PR addresses the original task's use case. While completing this task, I noticed many pandera error codes are obscure and that stringifying the SchemaError often gives a better error message. However, in some cases doing so messes up the formatting of the message in the frontend so that is a TODO that I will address later. In addition, it is difficult to catch every single schema error case but I have noted down the types of errors I have accounted for and expect the schema validation to catch so far..

Screenshot 2025-01-29 at 4 13 07 PM

@jessica-cheng jessica-cheng requested a review from pgm January 29, 2025 21:14
), # annotation_type_to_pandera_column_type(v.col_type),
nullable=False if dimension_type_identifier == k else True,
annotation_type_to_pandas_column_type(v.col_type),
coerce=True, # SchemaErrorReason: DATATYPE_COERCION
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading csv into df with expected dtype, check within the schema so that values in df column that are not of the assigned dtype raises a SchemaError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I do still have a bit of worry about this.

I think there's an edge case. I'd be most comfortable if you read all columns saying to expect "str" and then used the validator to coerce them into the right type.

I worry about unexpected conversions and the way that there might be some mangling along the way.

Here's a concrete example, image a csv which looks like:

v
1
2.1

Now, for column "v", I'd expect the values to be ["1", "2.1"] but this process will yield ["1.0", "2.1"].

Here's the code I used to verify this behavior:

>>> df = pd.read_csv(io.StringIO("v\n1\n2.1"))
>>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"])
['1.0', '2.1']

If you read in all columns as strings first, then there's no lossy transform and I think the problem is avoided:

>>> df = pd.read_csv(io.StringIO("v\n1\n2.1"), dtype=str)
>>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"])
['1', '2.1']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgm Is there really an issue with storing the values1.0 vs 1? I believe both the matrix and the tabular dataset validations coerces value types continuous to Float64. I personally liked standardizing all numbers to floats but let me know if there is a specific reason we should preserve original values

Copy link
Contributor Author

@jessica-cheng jessica-cheng Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I understand the issue now

Copy link
Contributor

@pgm pgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants