-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize perf results #5064
base: main
Are you sure you want to change the base?
Stabilize perf results #5064
Conversation
/bench_x64 |
/bench_x64 |
Change factor shows patch effect on x64 if merged compared to current head for main. Results are based on clocktick (CT) event cycles. Change Factor = (Patched_CT - Main_CT) / (Main_CT)
Averages (x64):
|
/bench_x64 |
2 similar comments
/bench_x64 |
/bench_x64 |
Two ideas for bounding stability:
|
Change factor shows patch effect on x64 if merged compared to current head for main. Results are based on clocktick (CT) event cycles. Change Factor = (Patched_CT - Main_CT) / (Main_CT)
Averages (x64):
|
In my experience even with dedicated hardware I've always had a lot of noise in time-based measurements, so for long-term regression testing which this is intended for would it be possible to measure instructions retired instead of wall-time? (which I think clock-cycles is more-or-less equivalent to). That's what rust-lang/rust uses by deault and instructions are typically quite stable (although not 100% still). Also, as a minor thing, would it be possible to print the changes as %-based changes instead of factor-based changes? |
Change factor shows patch effect on x64 if merged compared to current head for main. Results are based on clocktick (CT) event cycles. Change Factor = (Patched_CT - Main_CT) / (Main_CT)
Averages (x64):
|
Change factor shows patch effect on x64 if merged compared to current head for main. Results are based on clocktick (CT) event cycles. Change Factor = (Patched_CT - Main_CT) / (Main_CT)
Averages (x64):
|
Seems fine to exclude instantiation. We have decent instantiation benchmarks in criterion anyways.
This is more something for the |
Yeah, that's a good point actually; I agree. My main concern was that we have trustworthy results and actually using the confidence-interval computation is the best way of doing that. |
(And following on that a bit more, I guess what I really want is to sort of build up trust in the tool from first principles -- that's what I was trying to get at with the null-diff control; so perhaps this is a way we can validate the confidence interval reporting, when we get it integrated. If we submit an empty PR and benchmark it, we should see "no statistical difference" everywhere, or else we have a stats or configuration/setup bug) |
(Note that the probability of a false positive is 1% (due to our default significance level) but this is per test and we do 3 tests per Wasm input so we only need to have ~33 Wasm inputs to expect one false positive per benchmark run. One of the many reasons to choose our Wasm inputs carefully.) |
No description provided.