-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise error if FERC1 column renames don't match expected data #3791
Conversation
33a6d83
to
9433653
Compare
@zaneselvans I still have to do some validation testing before this is ready to go, but I'd be curious to get your design feedback on this before I do as it's a bit of a harder column renaming requirement than the 3 errors you'd originally listed. I do think it more closely mirrors what we do on the spreadsheet side, where we explicitly name each column that we're pulling in whether or not we rename it, and it'll prevent us from having unexpected column typos propagate as different factoids, for example. |
src/pudl/transform/classes.py
Outdated
# A dictionary of columns to be renamed. | ||
not_renamed_columns: list[str] = [] | ||
# A list of raw columns which are expected not to be renamed. Any other | ||
# columns in the raw data which fail to be renamed will raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make these comments into triple-quoted strings they'll show up in the documentation as docstrings on these attributes, which would be wonderful.
src/pudl/transform/params/ferc1.py
Outdated
@@ -2623,7 +2623,93 @@ | |||
"adjustments": "adjustments", | |||
"transfers": "transfers", | |||
"ending_balance": "ending_balance", | |||
} | |||
}, | |||
"not_renamed_columns": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol okay it does necessitate a bunch of new metadata.
I assume that you've programmatically generated these lists of not renamed columns. Are they just all of the columns that weren't getting renamed? If so, do we need to worry that we might be codifying errors in specifying them now? Like, are any of these columns that we weren't renaming actually columns that we should have been renaming, but weren't resulting in their contents getting lost when we eventually enforce_schema()
at the end of the transform step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are all generated programatically. I might suggest sorting the columns in the error message, to make it easier to spot near-duplicates/typos.
However, the overwhelming majority of these columns are factoid names, which we're collapsing into the xbrl_factoid
column, so they wouldn't be affected by enforce_schema
. Several others are not and should probably be checked.
Co-authored-by: Zane Selvans <[email protected]>
@cmgosnell I still want to kick off a full build once the coverage issues are fixed to ensure I haven't introduced any downstream changes, but I implemented the changes discussed so it should be ready for review otherwise! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a little funny that my suggestion to swap the not-being-renamed columns to enums results in more lines of code being changed.... but lol i think this is still a much better solution and the thing we really care about.
I made one small idek if its a good suggestion definitely not blocking feel free to ignore my tricks to reduce lines but not actually reduce complexity.
"enum": ASSET_TYPES_FERC1.extend( | ||
# Add all possible correction records into enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking at all: for all of these corrections, it am tempted to suggest adding a correction for all of them. i know not all factoids get corrections because not all factoids are calculated (and even within those not all of them get corrections bc some calculated fields are surprisingly clean). but i see why you wouldn't do this and don't feel strongly about it either way.
ASSET_TYPES_FERC1 + ["{t}_correction" for t in ASSET_TYPES_FERC1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled these lists from the calculation_components
table, which if I remember correctly has a factoid for each theoretically possible component, rather than just those we observe. If we get corrections for records that can't technically have corrections, I would imagine this should be an error?
Overview
Closes #3756
What problem does this address?
Adds two validations into the FERC transforms:
FIELD_METADATA_BY_RESOURCE
for their respective tables. (Note that enums also log a warning if there are values that aren't in the constraints, but as all these values are FK values they'll raise a primary key error whenwide_to_tidy
is run).What did you change?
Added one validation assertions. Debugged some renaming issues outlined in #3576. Constrained the factoid column for each FERC table.
Testing
How did you make sure this worked? How can a reviewer verify this?
Generate all
core_ferc1
assets and anything downstream.To-do list