Reduce time spent in each benchmark #318

mdboom · 2024-12-11T18:11:01Z

Some benchmarks probably run for longer than they need to to provide useful information.

Given the data we already have, it should be easy enough to do an analysis of how much variability there is between runs of each benchmark (we usually run each 20 times), to determine which benchmarks could safely be run fewer times.

Within each benchmark, however, they could theoretically be doing less work and still providing the same value. That will require reducing data sizes and constants within the benchmarks themselves and measuring the impact on stability. It is probably worth focusing effort on the longest running benchmarks first.

Cc: @mpage, @colesbury, @brandtbucher

mdboom · 2024-12-11T20:32:10Z

I've done some opportunity sizing on this. First a refresher that there are 3 loops that pyperf runs a benchmark in:

individual processes (20 times, by default)
outer loop (3 times, by default, preceded by a "warmup")
inner loop (dynamically determined so benchmark runs for at least 100ms -- bench_runner hard codes this number across runs to save time and be more consistent) (confusingly, pyperf docs call this "outer loop")
loops/data sizes/etc. determined by the benchmark itself

We have measurements for (1) and (2) (60 samples), but the idea behind pyperf is that the overhead of collecting on each iteration of (3) would interfere too much with the results.

So the easy experiment, given only the data we already have, is to see if (1) and (2) could be reduced without affecting the reliability of the results.

Here's the normalized standard deviation of each benchmark (normalized here meaning the unit is in "percentage change" not seconds). This is a reasonable ranking of "less reliable" to "more reliable":

Using sample size determination statistics, we can compute how many samples we need to take before we are 95% certain that the error bounds are less than 1%. From this we can see:

red: The 60 samples we already collect are not meeting our reliability target. The number in this case is kind of meaningless except to say we need "more" samples. These benchmarks would be good candidates for improvement.
yellow: We could safely reduce the number of samples taken.
green: A single sample seems to be enough.

From this, you can calculate the time one could save be reducing iterations:

tl;dr: This totals to a potential reduction of 26 minutes.

Note these totals include warmup time, which we assume we still need, which is why the reductions appear smaller than it seems. Determining whether warmups matter would be a separate experiment, but is potentially more runtime we could remove.

colesbury · 2024-12-11T21:46:15Z

Just looking at pylint briefly: it doesn't appear to be noisy. The results seem pretty stable for a given number of iterations. However, from the code:

pylint seems to speed up considerably as it progresses, and this
benchmark includes that.

We're throwing away the most important time (the first iteration / warmup). The first iteration is the one that's representative of how pylint is used.
The iterations times are not iid. Does that matter for samples size determination? Pylint might be the worst offender, but I suspect that's often true in general.

mdboom · 2024-12-11T22:36:57Z

Yes, I see what you are saying about pylint.

The iterations times are not iid.

That's true in practice, but yes the sample size determination assumes that they would be. I'm not sure how to resolve that with something like pylint where subsequent iterations get faster other than (a) only including the first iteration (which yes, is how it's probably most used in practice) or (b) including more iterations in a single sample until it stabilizes, both of which would basically make it more iid.

There's probably a different sample size metric that would take this ordered stabilization into account and maybe arrive at a different number of required samples (probably a lower one). I'll look into that for a bit...

mdboom · 2024-12-11T23:12:28Z

If I run the same experiment again, but rather than treating each of the 60 samples as iid, instead using the mean of the samples within the same process (resulting in 20 samples per benchmark), this does seem to account for the "warmup" behavior in the pylint benchmark and moves it from one of the least stable to one of the most stable.

You end up with slightly more potential time saved (26.8 min)

Plotting code

mdboom · 2024-12-11T23:50:53Z

It's interesting that there is a real mix of where the variability lies in the highly variable benchmarks. For example, as @colesbury pointed out, pylint is pretty stable by the inner loop iteration number:

coverage is pretty stable within each process run, but variable between them:

connected_components seems to be a weird mix of the two:

This was referenced Dec 12, 2024

Reduce the overall runtime of the benchmarks without increasing variability python/pyperformance#372

Open

Add warnings about too few or too many samples psf/pyperf#210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce time spent in each benchmark #318

Reduce time spent in each benchmark #318

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024 •

edited

Loading

colesbury commented Dec 11, 2024 •

edited

Loading

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024

Reduce time spent in each benchmark #318

Reduce time spent in each benchmark #318

Comments

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024 • edited Loading

colesbury commented Dec 11, 2024 • edited Loading

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024

mdboom commented Dec 11, 2024 •

edited

Loading

colesbury commented Dec 11, 2024 •

edited

Loading