-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CEMS partitions to handle year-quarter files #3096
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #3096 +/- ##
=======================================
- Coverage 92.6% 92.6% -0.0%
=======================================
Files 134 134
Lines 12577 12566 -11
=======================================
- Hits 11648 11634 -14
- Misses 929 932 +3 ☔ View full report in Codecov by Sentry. |
…pudl into cems-year_quarters
… into cems-quarterly
…pudl into cems-year_quarters
WIP Transition CEMS paritions to `year_quarter` from `year` and `quarter`
Edit: I pre-populated the datastore before starting the CEMS materialization last night, so it should not have been downloading anything. I checked the timestamps on the files this morning and they were all from the same time, before I ran the ETL. Also I've re-run the CEMS asset materializations this morning and it's looking like it'll take 2 hours again. Not sure what the difference is between my system and yours though. |
thanks for fixing that ci failure! Its curious that it took you 2 hours. eeee. it has consistently taken my computer ~40 minutes. When it needed to download a new archive, it took 58. Which is still ~2x time from the previous setup. I 100% agree that we should take some time to make it faster, but I don't think we should delay integrating this before we do that. |
Another weird thing that I'm seeing locally comparing my two sets of outputs is the new ETL produces significantly smaller outputs, despite including a little bit more data.
|
Hmm. Using the MacOS Activity Monitor (rather than btop) I see 10 python3.11 processes each of which claims to have 20 threads, and appears to be using multiple GB of memory, which means a lot of it is spilling onto swap on disk which would slow things down a lot. In |
It seems that if you select the While if you select the I feel like the configs are flaky in general, and have run into issues with them not getting updated when the settings files change in the context of the |
Make CEMS quarterly and add 2023 data! See #2973 for detailed task list.
This PR:
pudl.extract.epacems
to read in quarterly datapudl.transform.epacems
to handle quarterly datapudl.etl.epacems_assets
to write year/state row groups from quarterly parquet files to the monolithic outputhourly_emissions_epacems.process_single_year()
to prevent OOM issues (currently to 2 threads)PR Checklist
dev
).