-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce time spent in each benchmark #318
Comments
I've done some opportunity sizing on this. First a refresher that there are 3 loops that pyperf runs a benchmark in:
We have measurements for (1) and (2) (60 samples), but the idea behind pyperf is that the overhead of collecting on each iteration of (3) would interfere too much with the results. So the easy experiment, given only the data we already have, is to see if (1) and (2) could be reduced without affecting the reliability of the results. Here's the normalized standard deviation of each benchmark (normalized here meaning the unit is in "percentage change" not seconds). This is a reasonable ranking of "less reliable" to "more reliable": Using sample size determination statistics, we can compute how many samples we need to take before we are 95% certain that the error bounds are less than 1%. From this we can see:
From this, you can calculate the time one could save be reducing iterations: tl;dr: This totals to a potential reduction of 26 minutes. Note these totals include warmup time, which we assume we still need, which is why the reductions appear smaller than it seems. Determining whether warmups matter would be a separate experiment, but is potentially more runtime we could remove. |
Just looking at pylint briefly: it doesn't appear to be noisy. The results seem pretty stable for a given number of iterations. However, from the code:
|
Yes, I see what you are saying about pylint.
That's true in practice, but yes the sample size determination assumes that they would be. I'm not sure how to resolve that with something like pylint where subsequent iterations get faster other than (a) only including the first iteration (which yes, is how it's probably most used in practice) or (b) including more iterations in a single sample until it stabilizes, both of which would basically make it more iid. There's probably a different sample size metric that would take this ordered stabilization into account and maybe arrive at a different number of required samples (probably a lower one). I'll look into that for a bit... |
If I run the same experiment again, but rather than treating each of the 60 samples as iid, instead using the mean of the samples within the same process (resulting in 20 samples per benchmark), this does seem to account for the "warmup" behavior in the pylint benchmark and moves it from one of the least stable to one of the most stable. You end up with slightly more potential time saved (26.8 min) |
It's interesting that there is a real mix of where the variability lies in the highly variable benchmarks. For example, as @colesbury pointed out, pylint is pretty stable by the inner loop iteration number: coverage is pretty stable within each process run, but variable between them: connected_components seems to be a weird mix of the two: |
Some benchmarks probably run for longer than they need to to provide useful information.
Given the data we already have, it should be easy enough to do an analysis of how much variability there is between runs of each benchmark (we usually run each 20 times), to determine which benchmarks could safely be run fewer times.
Within each benchmark, however, they could theoretically be doing less work and still providing the same value. That will require reducing data sizes and constants within the benchmarks themselves and measuring the impact on stability. It is probably worth focusing effort on the longest running benchmarks first.
Cc: @mpage, @colesbury, @brandtbucher
The text was updated successfully, but these errors were encountered: