Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce time spent in each benchmark #318

Open
mdboom opened this issue Dec 11, 2024 · 5 comments
Open

Reduce time spent in each benchmark #318

mdboom opened this issue Dec 11, 2024 · 5 comments

Comments

@mdboom
Copy link
Contributor

mdboom commented Dec 11, 2024

Some benchmarks probably run for longer than they need to to provide useful information.

Given the data we already have, it should be easy enough to do an analysis of how much variability there is between runs of each benchmark (we usually run each 20 times), to determine which benchmarks could safely be run fewer times.

Within each benchmark, however, they could theoretically be doing less work and still providing the same value. That will require reducing data sizes and constants within the benchmarks themselves and measuring the impact on stability. It is probably worth focusing effort on the longest running benchmarks first.

Cc: @mpage, @colesbury, @brandtbucher

@mdboom
Copy link
Contributor Author

mdboom commented Dec 11, 2024

I've done some opportunity sizing on this. First a refresher that there are 3 loops that pyperf runs a benchmark in:

  1. individual processes (20 times, by default)
  2. outer loop (3 times, by default, preceded by a "warmup")
  3. inner loop (dynamically determined so benchmark runs for at least 100ms -- bench_runner hard codes this number across runs to save time and be more consistent) (confusingly, pyperf docs call this "outer loop")
  4. loops/data sizes/etc. determined by the benchmark itself

We have measurements for (1) and (2) (60 samples), but the idea behind pyperf is that the overhead of collecting on each iteration of (3) would interfere too much with the results.

So the easy experiment, given only the data we already have, is to see if (1) and (2) could be reduced without affecting the reliability of the results.

Here's the normalized standard deviation of each benchmark (normalized here meaning the unit is in "percentage change" not seconds). This is a reasonable ranking of "less reliable" to "more reliable":

stddev

Using sample size determination statistics, we can compute how many samples we need to take before we are 95% certain that the error bounds are less than 1%. From this we can see:

  • red: The 60 samples we already collect are not meeting our reliability target. The number in this case is kind of meaningless except to say we need "more" samples. These benchmarks would be good candidates for improvement.
  • yellow: We could safely reduce the number of samples taken.
  • green: A single sample seems to be enough.

samples

From this, you can calculate the time one could save be reducing iterations:

totals

tl;dr: This totals to a potential reduction of 26 minutes.

Note these totals include warmup time, which we assume we still need, which is why the reductions appear smaller than it seems. Determining whether warmups matter would be a separate experiment, but is potentially more runtime we could remove.

@colesbury
Copy link

colesbury commented Dec 11, 2024

Just looking at pylint briefly: it doesn't appear to be noisy. The results seem pretty stable for a given number of iterations. However, from the code:

pylint seems to speed up considerably as it progresses, and this
benchmark includes that.

  1. We're throwing away the most important time (the first iteration / warmup). The first iteration is the one that's representative of how pylint is used.
  2. The iterations times are not iid. Does that matter for samples size determination? Pylint might be the worst offender, but I suspect that's often true in general.

@mdboom
Copy link
Contributor Author

mdboom commented Dec 11, 2024

Yes, I see what you are saying about pylint.

The iterations times are not iid.

That's true in practice, but yes the sample size determination assumes that they would be. I'm not sure how to resolve that with something like pylint where subsequent iterations get faster other than (a) only including the first iteration (which yes, is how it's probably most used in practice) or (b) including more iterations in a single sample until it stabilizes, both of which would basically make it more iid.

There's probably a different sample size metric that would take this ordered stabilization into account and maybe arrive at a different number of required samples (probably a lower one). I'll look into that for a bit...

@mdboom
Copy link
Contributor Author

mdboom commented Dec 11, 2024

If I run the same experiment again, but rather than treating each of the 60 samples as iid, instead using the mean of the samples within the same process (resulting in 20 samples per benchmark), this does seem to account for the "warmup" behavior in the pylint benchmark and moves it from one of the least stable to one of the most stable.

You end up with slightly more potential time saved (26.8 min)

Plotting code

stddev
samples
totals

@mdboom
Copy link
Contributor Author

mdboom commented Dec 11, 2024

It's interesting that there is a real mix of where the variability lies in the highly variable benchmarks. For example, as @colesbury pointed out, pylint is pretty stable by the inner loop iteration number:

pylint

coverage is pretty stable within each process run, but variable between them:

coverage

connected_components seems to be a weird mix of the two:

connected_components

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants