Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved outputs to analyze integration test results #445

Merged
merged 5 commits into from
Mar 18, 2025
Merged

Conversation

aspeake
Copy link
Collaborator

@aspeake aspeake commented Nov 16, 2024

Fixes #415

Introduces a class to compare integration test results on a branch with the results stored on master. A previous PR (#440) added all integration test results to master. This PR provides a way of evaluating the differences between the working branch and master, which include:

  • Store plot pdfs as CI artifacts so that they can be visually compared against master
  • New script compare_results.py to:
    • Output differences in keys between the branches' agg_results and ecm_results (output to agg_results_key_diffs.csv and ecm_results_key_diffs.csv)
    • Output percent differences in values between the branches' agg_results and ecm_results, as long as values meet an absolute threshold and the differences meet a percent threshold (output to agg_results_value_diffs.csv and ecm_results_value_diffs.csv)
    • Output percent differences in values between branches' Summary_Data-MAP.xlsx and Summary_Data-TP.xlsx (output to Summary_Data-MAP_percent_diffs.csv and Summary_Data-TP_percent_diffs.csv)
  • Update the Github Actions workflow so that when there are differences between the branch and master agg_results.json or ecm_results.json, then:
    • Commit new results and plots (same as before)
    • Pull down agg_results.json, ecm_results.json, Summary_Data-TP.xlsx, Summary_Data-MAP.xlsx from master, store in tests/integration_tests/results_base
    • Run tests/integration_tests/compare_results.py
    • Store the output csvs described above as CI artifacts

Example Outputs
Example CI artifacts are found at https://github.com/trynthink/scout/actions/runs/13660717679

Example *_results_key_diffs.csv:
image
image

Example *_results_value_diffs.csv:
image
image

Example Summary_Data-*_percent_diffs.xlsx:
Same format as original xlsx files, but values are the percent differences
image

@aspeake aspeake force-pushed the ci_outputs_2 branch 2 times, most recently from 1e36329 to 9e38c4e Compare November 18, 2024 22:45
@aspeake aspeake added this to the v1.1.0 milestone Nov 19, 2024
@aspeake aspeake self-assigned this Nov 19, 2024
@aspeake aspeake force-pushed the ci_outputs_2 branch 5 times, most recently from 7e81ca3 to a568072 Compare November 21, 2024 15:46

return key_diffs

def compare_dict_values(self, dict1, dict2, percent_threshold=10, abs_threshold=1000):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the threshold be when deciding to report percent changes of json values? Should the thresholds for agg_results be different from ecm_results?

note - percent threshold means that only differences >= to that will be reported, absolute threshold only reports differences if the original values exceed that number to prevent outputting large percent diffs due to small numbers.

.gitignore Outdated

!tests/integration_testing/results/plots/tech_potential/*.xlsx
!tests/integration_testing/results/plots/max_adopt_potential/*.xlsx
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To overwrite the ignored .xlsx files specified above

@aspeake
Copy link
Collaborator Author

aspeake commented Jan 24, 2025

Remaining tasks

  • Revisit threshold for reporting diffs (is 10% and 1,000 correct?)
  • Add two columns to *_results_value_diffs.csv that show the original, absolute values to provide context for the percentage diffs.

@aspeake aspeake force-pushed the ci_outputs_2 branch 3 times, most recently from 35d720b to a782f5a Compare March 4, 2025 00:35
@aspeake aspeake force-pushed the ci_outputs_2 branch 5 times, most recently from b34265e to 0a12c03 Compare March 4, 2025 18:44
@aspeake aspeake requested a review from jtlangevin March 4, 2025 19:07
@aspeake
Copy link
Collaborator Author

aspeake commented Mar 4, 2025

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

@jtlangevin
Copy link
Collaborator

jtlangevin commented Mar 10, 2025

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

@aspeake
Copy link
Collaborator Author

aspeake commented Mar 10, 2025

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.
A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

So the artifacts in that dummy PR are dependent of how the results changed. I just trimmed down the list of ECMs, meaning that there are a lot of differences in the json keys for ecm_results (found in ecm_results_key_diffs.csv), but the actual values of the common ECMs did not change, so ecm_results_value_diffs.csv was not produced.

In the PR description, the second screenshot under "Example *_results_value_diffs.csv:" shows pretty close what the ecm_results_value_diffs.csv would look like.

@aspeake aspeake force-pushed the ci_outputs_2 branch 3 times, most recently from 68d85ab to 1b4ad6d Compare March 17, 2025 20:25
@aspeake aspeake merged commit 436a197 into master Mar 18, 2025
9 checks passed
@aspeake aspeake deleted the ci_outputs_2 branch March 18, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output visualizations/metrics for changes of integration test results
3 participants