fix(breadbox): Refactor tabular dataset schema error #183

jessica-cheng · 2025-01-29T21:14:46Z

Error message returned to client is an obscure pandera error. Change error message to be more intuitive to users

This PR addresses the original task's use case. While completing this task, I noticed many pandera error codes are obscure and that stringifying the SchemaError often gives a better error message. However, in some cases doing so messes up the formatting of the message in the frontend so that is a TODO that I will address later. In addition, it is difficult to catch every single schema error case but I have noted down the types of errors I have accounted for and expect the schema validation to catch so far..

jessica-cheng · 2025-01-29T21:18:08Z

breadbox/breadbox/io/data_validation.py

-                ),  # annotation_type_to_pandera_column_type(v.col_type),
-                nullable=False if dimension_type_identifier == k else True,
+                annotation_type_to_pandas_column_type(v.col_type),
+                coerce=True,  # SchemaErrorReason: DATATYPE_COERCION


Instead of reading csv into df with expected dtype, check within the schema so that values in df column that are not of the assigned dtype raises a SchemaError

Thinking about this more, I do still have a bit of worry about this.

I think there's an edge case. I'd be most comfortable if you read all columns saying to expect "str" and then used the validator to coerce them into the right type.

I worry about unexpected conversions and the way that there might be some mangling along the way.

Here's a concrete example, image a csv which looks like:

v 1 2.1

Now, for column "v", I'd expect the values to be ["1", "2.1"] but this process will yield ["1.0", "2.1"].

Here's the code I used to verify this behavior:

>>> df = pd.read_csv(io.StringIO("v\n1\n2.1")) >>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"]) ['1.0', '2.1']

If you read in all columns as strings first, then there's no lossy transform and I think the problem is avoided:

>>> df = pd.read_csv(io.StringIO("v\n1\n2.1"), dtype=str) >>> list(pa.DataFrameSchema({"v": pa.Column(str)}, coerce=True).validate(df)["v"]) ['1', '2.1']

@pgm Is there really an issue with storing the values1.0 vs 1? I believe both the matrix and the tabular dataset validations coerces value types continuous to Float64. I personally liked standardizing all numbers to floats but let me know if there is a specific reason we should preserve original values

Nevermind I understand the issue now

pgm

LGTM

jessica-cheng added 2 commits January 29, 2025 14:54

Modify error message for missing column in schema

b3806fb

Catch value with dtype error in schema validation instead of df parsing

ff9b1bb

jessica-cheng requested a review from pgm January 29, 2025 21:14

jessica-cheng commented Jan 29, 2025

View reviewed changes

pgm approved these changes Feb 5, 2025

View reviewed changes

Parse df values as string first before coercing type

83d55fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(breadbox): Refactor tabular dataset schema error #183

fix(breadbox): Refactor tabular dataset schema error #183

jessica-cheng commented Jan 29, 2025 •

edited

Loading

jessica-cheng Jan 29, 2025

pgm Feb 5, 2025

jessica-cheng Feb 6, 2025

jessica-cheng Feb 6, 2025 •

edited

Loading

pgm left a comment

fix(breadbox): Refactor tabular dataset schema error #183

Are you sure you want to change the base?

fix(breadbox): Refactor tabular dataset schema error #183

Conversation

jessica-cheng commented Jan 29, 2025 • edited Loading

jessica-cheng Jan 29, 2025

Choose a reason for hiding this comment

pgm Feb 5, 2025

Choose a reason for hiding this comment

jessica-cheng Feb 6, 2025

Choose a reason for hiding this comment

jessica-cheng Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

pgm left a comment

Choose a reason for hiding this comment

jessica-cheng commented Jan 29, 2025 •

edited

Loading

jessica-cheng Feb 6, 2025 •

edited

Loading