Raise error if FERC1 column renames don't match expected data #3791

e-belfer · 2024-08-15T00:06:01Z

Overview

What problem does this address?
Adds two validations into the FERC transforms:

If we're attempting to rename a column, that column should exist in the raw data.
All expected factoids are added as ENUMS in FIELD_METADATA_BY_RESOURCE for their respective tables. (Note that enums also log a warning if there are values that aren't in the constraints, but as all these values are FK values they'll raise a primary key error when wide_to_tidy is run).

What did you change?
Added one validation assertions. Debugged some renaming issues outlined in #3576. Constrained the factoid column for each FERC table.

Testing

How did you make sure this worked? How can a reviewer verify this?
Generate all core_ferc1 assets and anything downstream.

To-do list

Give feedback

Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
Options

… renamed

e-belfer · 2024-08-15T00:09:34Z

@zaneselvans I still have to do some validation testing before this is ready to go, but I'd be curious to get your design feedback on this before I do as it's a bit of a harder column renaming requirement than the 3 errors you'd originally listed. I do think it more closely mirrors what we do on the spreadsheet side, where we explicitly name each column that we're pulling in whether or not we rename it, and it'll prevent us from having unexpected column typos propagate as different factoids, for example.

zaneselvans · 2024-08-15T00:20:21Z

src/pudl/transform/classes.py

+    # A dictionary of columns to be renamed.
+    not_renamed_columns: list[str] = []
+    # A list of raw columns which are expected not to be renamed. Any other
+    # columns in the raw data which fail to be renamed will raise an error.


If we make these comments into triple-quoted strings they'll show up in the documentation as docstrings on these attributes, which would be wonderful.

src/pudl/transform/classes.py

zaneselvans · 2024-08-15T00:30:55Z

src/pudl/transform/params/ferc1.py

@@ -2623,7 +2623,93 @@
                    "adjustments": "adjustments",
                    "transfers": "transfers",
                    "ending_balance": "ending_balance",
-                }
+                },
+                "not_renamed_columns": [


Lol okay it does necessitate a bunch of new metadata.

I assume that you've programmatically generated these lists of not renamed columns. Are they just all of the columns that weren't getting renamed? If so, do we need to worry that we might be codifying errors in specifying them now? Like, are any of these columns that we weren't renaming actually columns that we should have been renaming, but weren't resulting in their contents getting lost when we eventually enforce_schema() at the end of the transform step?

Yes, these are all generated programatically. I might suggest sorting the columns in the error message, to make it easier to spot near-duplicates/typos.

However, the overwhelming majority of these columns are factoid names, which we're collapsing into the xbrl_factoid column, so they wouldn't be affected by enforce_schema. Several others are not and should probably be checked.

Co-authored-by: Zane Selvans <[email protected]>

e-belfer · 2024-09-03T22:22:30Z

@cmgosnell I still want to kick off a full build once the coverage issues are fixed to ensure I haven't introduced any downstream changes, but I implemented the changes discussed so it should be ready for review otherwise!

cmgosnell

its a little funny that my suggestion to swap the not-being-renamed columns to enums results in more lines of code being changed.... but lol i think this is still a much better solution and the thing we really care about.

I made one small idek if its a good suggestion definitely not blocking feel free to ignore my tricks to reduce lines but not actually reduce complexity.

cmgosnell · 2024-09-04T15:48:52Z

src/pudl/metadata/fields.py

+            "enum": ASSET_TYPES_FERC1.extend(
+                # Add all possible correction records into enum


not blocking at all: for all of these corrections, it am tempted to suggest adding a correction for all of them. i know not all factoids get corrections because not all factoids are calculated (and even within those not all of them get corrections bc some calculated fields are surprisingly clean). but i see why you wouldn't do this and don't feel strongly about it either way.

ASSET_TYPES_FERC1 + ["{t}_correction" for t in ASSET_TYPES_FERC1]

I pulled these lists from the calculation_components table, which if I remember correctly has a factoid for each theoretically possible component, rather than just those we observe. If we get corrections for records that can't technically have corrections, I would imagine this should be an error?

e-belfer requested a review from zaneselvans August 15, 2024 00:06

e-belfer self-assigned this Aug 15, 2024

zaneselvans and others added 4 commits August 14, 2024 20:07

Flag missing/extra columns as extraction errors.

aabe2cc

Add ignore_columns, additional validation test

874d08e

Finish handling explicit naming of columns that aren't expected to be…

6308765

… renamed

Fix missing column problems

9433653

e-belfer force-pushed the ferc1-column-errors branch from 33a6d83 to 9433653 Compare August 15, 2024 00:07

e-belfer added ferc1 Anything having to do with FERC Form 1 data-validation Issues related to checking whether data meets our quality expectations. labels Aug 15, 2024

zaneselvans reviewed Aug 15, 2024

View reviewed changes

e-belfer and others added 7 commits August 15, 2024 11:43

Update src/pudl/transform/classes.py

fc32a41

Co-authored-by: Zane Selvans <[email protected]>

Sort columns in error logs, update docstrings

73882e3

Fix merge conflicts

a49d19b

Merge branch 'main' into ferc1-column-errors

3ca30af

Remove not-renamed method from transform method and parameter

1c3af8a

Add factoids to enums and update migration - WIP

ed5c62b

Fix dtypes in table_dimensions

fcdd144

e-belfer requested a review from cmgosnell September 3, 2024 22:21

cmgosnell approved these changes Sep 4, 2024

View reviewed changes

Merge branch 'main' into ferc1-column-errors

f202cc4

e-belfer enabled auto-merge September 5, 2024 14:09

e-belfer added this pull request to the merge queue Sep 5, 2024

Merged via the queue into main with commit 7ac304f Sep 5, 2024
17 checks passed

e-belfer deleted the ferc1-column-errors branch September 5, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise error if FERC1 column renames don't match expected data #3791

Raise error if FERC1 column renames don't match expected data #3791

e-belfer commented Aug 15, 2024 •

edited

Loading

To-do list

e-belfer commented Aug 15, 2024 •

edited

Loading

zaneselvans Aug 15, 2024

zaneselvans Aug 15, 2024

e-belfer Aug 15, 2024

e-belfer commented Sep 3, 2024

cmgosnell left a comment

cmgosnell Sep 4, 2024

e-belfer Sep 4, 2024

		"enum": ASSET_TYPES_FERC1.extend(
		# Add all possible correction records into enum

Raise error if FERC1 column renames don't match expected data #3791

Raise error if FERC1 column renames don't match expected data #3791

Conversation

e-belfer commented Aug 15, 2024 • edited Loading

Overview

Testing

To-do list

e-belfer commented Aug 15, 2024 • edited Loading

zaneselvans Aug 15, 2024

Choose a reason for hiding this comment

zaneselvans Aug 15, 2024

Choose a reason for hiding this comment

e-belfer Aug 15, 2024

Choose a reason for hiding this comment

e-belfer commented Sep 3, 2024

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Sep 4, 2024

Choose a reason for hiding this comment

e-belfer Sep 4, 2024

Choose a reason for hiding this comment

e-belfer commented Aug 15, 2024 •

edited

Loading

e-belfer commented Aug 15, 2024 •

edited

Loading