Improved outputs to analyze integration test results #445

aspeake · 2024-11-16T01:07:01Z

Fixes #415

Introduces a class to compare integration test results on a branch with the results stored on master. A previous PR (#440) added all integration test results to master. This PR provides a way of evaluating the differences between the working branch and master, which include:

Store plot pdfs as CI artifacts so that they can be visually compared against master
New script compare_results.py to:
- Output differences in keys between the branches' agg_results and ecm_results (output to agg_results_key_diffs.csv and ecm_results_key_diffs.csv)
- Output percent differences in values between the branches' agg_results and ecm_results, as long as values meet an absolute threshold and the differences meet a percent threshold (output to agg_results_value_diffs.csv and ecm_results_value_diffs.csv)
- Output percent differences in values between branches' Summary_Data-MAP.xlsx and Summary_Data-TP.xlsx (output to Summary_Data-MAP_percent_diffs.csv and Summary_Data-TP_percent_diffs.csv)
Update the Github Actions workflow so that when there are differences between the branch and master agg_results.json or ecm_results.json, then:
- Commit new results and plots (same as before)
- Pull down agg_results.json, ecm_results.json, Summary_Data-TP.xlsx, Summary_Data-MAP.xlsx from master, store in tests/integration_tests/results_base
- Run tests/integration_tests/compare_results.py
- Store the output csvs described above as CI artifacts

Example Outputs
Example CI artifacts are found at https://github.com/trynthink/scout/actions/runs/13660717679

Example *_results_key_diffs.csv:

Example *_results_value_diffs.csv:

Example Summary_Data-*_percent_diffs.xlsx:
Same format as original xlsx files, but values are the percent differences

aspeake · 2024-11-21T15:48:49Z

tests/integration_testing/compare_results.py

+
+        return key_diffs
+
+    def compare_dict_values(self, dict1, dict2, percent_threshold=10, abs_threshold=1000):


What should the threshold be when deciding to report percent changes of json values? Should the thresholds for agg_results be different from ecm_results?

note - percent threshold means that only differences >= to that will be reported, absolute threshold only reports differences if the original values exceed that number to prevent outputting large percent diffs due to small numbers.

aspeake · 2024-11-21T16:47:42Z

.gitignore

+
+!tests/integration_testing/results/plots/tech_potential/*.xlsx
+!tests/integration_testing/results/plots/max_adopt_potential/*.xlsx


To overwrite the ignored .xlsx files specified above

aspeake · 2025-01-24T21:52:36Z

Remaining tasks

Revisit threshold for reporting diffs (is 10% and 1,000 correct?)
Add two columns to *_results_value_diffs.csv that show the original, absolute values to provide context for the percentage diffs.

aspeake · 2025-03-04T19:07:29Z

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

jtlangevin · 2025-03-10T15:45:52Z

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

aspeake · 2025-03-10T16:20:33Z

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.
A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

So the artifacts in that dummy PR are dependent of how the results changed. I just trimmed down the list of ECMs, meaning that there are a lot of differences in the json keys for ecm_results (found in ecm_results_key_diffs.csv), but the actual values of the common ECMs did not change, so ecm_results_value_diffs.csv was not produced.

In the PR description, the second screenshot under "Example *_results_value_diffs.csv:" shows pretty close what the ecm_results_value_diffs.csv would look like.

…f baseline and new results for CI comparisons.

…or some methods; better documentation.

aspeake force-pushed the ci_outputs_2 branch 2 times, most recently from 1e36329 to 9e38c4e Compare November 18, 2024 22:45

aspeake added this to the v1.1.0 milestone Nov 19, 2024

aspeake force-pushed the ci_outputs_2 branch from 96bfccb to f5ba137 Compare November 19, 2024 18:07

aspeake self-assigned this Nov 19, 2024

aspeake force-pushed the ci_outputs_2 branch 5 times, most recently from 7e81ca3 to a568072 Compare November 21, 2024 15:46

aspeake commented Nov 21, 2024

View reviewed changes

aspeake assigned jmythms Jan 24, 2025

aspeake force-pushed the ci_outputs_2 branch 3 times, most recently from 35d720b to a782f5a Compare March 4, 2025 00:35

aspeake unassigned jmythms Mar 4, 2025

aspeake force-pushed the ci_outputs_2 branch 5 times, most recently from b34265e to 0a12c03 Compare March 4, 2025 18:44

aspeake requested a review from jtlangevin March 4, 2025 19:07

aspeake force-pushed the ci_outputs_2 branch from 0a12c03 to 3c49309 Compare March 11, 2025 17:36

aspeake added 2 commits March 14, 2025 08:29

New script to compare integration test results with master.

3dc2726

Method to compare summary xlsx reports from CI; better organization o…

217bc45

…f baseline and new results for CI comparisons.

Write each key as a column in the *results_value_diffs.csv files.

26a0f8f

aspeake force-pushed the ci_outputs_2 branch from 3c49309 to b6012c5 Compare March 14, 2025 14:29

jtlangevin approved these changes Mar 17, 2025

View reviewed changes

aspeake force-pushed the ci_outputs_2 branch from b6012c5 to 529c8da Compare March 17, 2025 18:16

In compare_results.py, set threshold based on reporting units; refact…

d826fe5

…or some methods; better documentation.

aspeake force-pushed the ci_outputs_2 branch 3 times, most recently from 68d85ab to 1b4ad6d Compare March 17, 2025 20:25

Fix writing of profiler data on CI

68a3bde

aspeake force-pushed the ci_outputs_2 branch from 1b4ad6d to 68a3bde Compare March 18, 2025 03:29

aspeake merged commit 436a197 into master Mar 18, 2025
9 checks passed

aspeake deleted the ci_outputs_2 branch March 18, 2025 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved outputs to analyze integration test results #445

Improved outputs to analyze integration test results #445

aspeake commented Nov 16, 2024 •

edited

Loading

aspeake Nov 21, 2024

aspeake Nov 21, 2024

aspeake commented Jan 24, 2025 •

edited

Loading

aspeake commented Mar 4, 2025

jtlangevin commented Mar 10, 2025 •

edited

Loading

aspeake commented Mar 10, 2025


		return key_diffs

		def compare_dict_values(self, dict1, dict2, percent_threshold=10, abs_threshold=1000):


		!tests/integration_testing/results/plots/tech_potential/*.xlsx
		!tests/integration_testing/results/plots/max_adopt_potential/*.xlsx

Improved outputs to analyze integration test results #445

Improved outputs to analyze integration test results #445

Conversation

aspeake commented Nov 16, 2024 • edited Loading

aspeake Nov 21, 2024

Choose a reason for hiding this comment

aspeake Nov 21, 2024

Choose a reason for hiding this comment

aspeake commented Jan 24, 2025 • edited Loading

aspeake commented Mar 4, 2025

jtlangevin commented Mar 10, 2025 • edited Loading

aspeake commented Mar 10, 2025

aspeake commented Nov 16, 2024 •

edited

Loading

aspeake commented Jan 24, 2025 •

edited

Loading

jtlangevin commented Mar 10, 2025 •

edited

Loading