Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error if FERC1 column renames don't match expected data #3791

Merged
merged 12 commits into from
Sep 5, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Aug 15, 2024

Overview

Closes #3756

What problem does this address?
Adds two validations into the FERC transforms:

  1. If we're attempting to rename a column, that column should exist in the raw data.
  2. All expected factoids are added as ENUMS in FIELD_METADATA_BY_RESOURCE for their respective tables. (Note that enums also log a warning if there are values that aren't in the constraints, but as all these values are FK values they'll raise a primary key error when wide_to_tidy is run).

What did you change?
Added one validation assertions. Debugged some renaming issues outlined in #3576. Constrained the factoid column for each FERC table.

Testing

How did you make sure this worked? How can a reviewer verify this?
Generate all core_ferc1 assets and anything downstream.

To-do list

Preview Give feedback

@e-belfer e-belfer requested a review from zaneselvans August 15, 2024 00:06
@e-belfer e-belfer self-assigned this Aug 15, 2024
@e-belfer e-belfer force-pushed the ferc1-column-errors branch from 33a6d83 to 9433653 Compare August 15, 2024 00:07
@e-belfer
Copy link
Member Author

e-belfer commented Aug 15, 2024

@zaneselvans I still have to do some validation testing before this is ready to go, but I'd be curious to get your design feedback on this before I do as it's a bit of a harder column renaming requirement than the 3 errors you'd originally listed. I do think it more closely mirrors what we do on the spreadsheet side, where we explicitly name each column that we're pulling in whether or not we rename it, and it'll prevent us from having unexpected column typos propagate as different factoids, for example.

@e-belfer e-belfer added ferc1 Anything having to do with FERC Form 1 data-validation Issues related to checking whether data meets our quality expectations. labels Aug 15, 2024
Comment on lines 263 to 266
# A dictionary of columns to be renamed.
not_renamed_columns: list[str] = []
# A list of raw columns which are expected not to be renamed. Any other
# columns in the raw data which fail to be renamed will raise an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make these comments into triple-quoted strings they'll show up in the documentation as docstrings on these attributes, which would be wonderful.

src/pudl/transform/classes.py Outdated Show resolved Hide resolved
@@ -2623,7 +2623,93 @@
"adjustments": "adjustments",
"transfers": "transfers",
"ending_balance": "ending_balance",
}
},
"not_renamed_columns": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol okay it does necessitate a bunch of new metadata.

I assume that you've programmatically generated these lists of not renamed columns. Are they just all of the columns that weren't getting renamed? If so, do we need to worry that we might be codifying errors in specifying them now? Like, are any of these columns that we weren't renaming actually columns that we should have been renaming, but weren't resulting in their contents getting lost when we eventually enforce_schema() at the end of the transform step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are all generated programatically. I might suggest sorting the columns in the error message, to make it easier to spot near-duplicates/typos.

However, the overwhelming majority of these columns are factoid names, which we're collapsing into the xbrl_factoid column, so they wouldn't be affected by enforce_schema. Several others are not and should probably be checked.

@e-belfer e-belfer requested a review from cmgosnell September 3, 2024 22:21
@e-belfer
Copy link
Member Author

e-belfer commented Sep 3, 2024

@cmgosnell I still want to kick off a full build once the coverage issues are fixed to ensure I haven't introduced any downstream changes, but I implemented the changes discussed so it should be ready for review otherwise!

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little funny that my suggestion to swap the not-being-renamed columns to enums results in more lines of code being changed.... but lol i think this is still a much better solution and the thing we really care about.

I made one small idek if its a good suggestion definitely not blocking feel free to ignore my tricks to reduce lines but not actually reduce complexity.

Comment on lines +206 to +207
"enum": ASSET_TYPES_FERC1.extend(
# Add all possible correction records into enum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking at all: for all of these corrections, it am tempted to suggest adding a correction for all of them. i know not all factoids get corrections because not all factoids are calculated (and even within those not all of them get corrections bc some calculated fields are surprisingly clean). but i see why you wouldn't do this and don't feel strongly about it either way.

ASSET_TYPES_FERC1 + ["{t}_correction" for t in ASSET_TYPES_FERC1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled these lists from the calculation_components table, which if I remember correctly has a factoid for each theoretically possible component, rather than just those we observe. If we get corrections for records that can't technically have corrections, I would imagine this should be an error?

@e-belfer e-belfer enabled auto-merge September 5, 2024 14:09
@e-belfer e-belfer added this pull request to the merge queue Sep 5, 2024
Merged via the queue into main with commit 7ac304f Sep 5, 2024
17 checks passed
@e-belfer e-belfer deleted the ferc1-column-errors branch September 5, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-validation Issues related to checking whether data meets our quality expectations. ferc1 Anything having to do with FERC Form 1
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Make FERC column mapping warnings into errors
3 participants