Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept new requests on SVE builders only if idle #81

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

luporl
Copy link
Contributor

@luporl luporl commented Dec 7, 2023

Some of Linaro's SVE builders are suffering from starvation,
because there is currently no way to limit how many builds a given
builder can start simultaneously. As there are 4 SVE builders that
may use any of the 4 G3 workers, sometimes some builders use
multiple workers while others end up with none available. This is
aggravated by the fact that the same builders may use more than one
worker for several times in a row, causing other builders to starve
(clang-aarch64-sve-vls, for instance, has been idle for over 3
days once).

The buildbot documentation (2.5.2.8. Prioritizing Builders) says
that builds are started in the builder with the oldest pending
request, but it seems this is not working, possibly because the
collapse requests feature ends up resetting submittedAt time.

While there is a way to limit how many builds a single worker can
run, there is currently no way to limit how many builds a builder
can be running on a pool of workers.

As shown below, this PR aims to enable us to prevent Builder A from
building on Worker 1 and Worker 2 at the same time. Starving B of
resources.

Builder A -->|------------|
           | | Worker 1   |
           | |------------|
           |
Builder B -->|------------|
             | Worker 2   |
             |------------|

This is accomplished by introducing the max_simultaneous_builds
setting for builders. Its value specifies the maximum number of
builds that a builder can have simultaneously. This limit is
enforced using nextBuild, from BuilderConfig. With it, it was
possible to limit Linaro's SVE builders to only one build at a time.

Use nextBuild on SVE builders to accept new build requests only if
the corresponding builder is not currently building anything.

This is needed on SVE bots because any of the 4 available workers
can accept requests from any of the 4 SVE builders, and sometimes
some builders use multiple workers while others end up with none
available. This is aggravated by the fact that the same builders
may use more than one workers for several times in a row, causing
other builders to starve (clang-aarch64-sve-vls, for instance, did
not build anything in the last 3 days).

The buildbot documentation (2.5.2.8. Prioritizing Builders) says
that builds are started in the builder with the oldest pending
request, but it seems this is not working, possibly because the
collapse requests feature ends up resetting submittedAt time.
if b['name'].startswith("clang-aarch64-sve-"):
nextBuild = nextSVEBuild
builders.append(util.BuilderConfig(**b, nextBuild=nextBuild))
c['builders'] = builders
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, I know it's a pain to setup a whole local infra for these things.

I get the logic here but I think we'll want it to be more general:

  • Add an attribute to our builder class max_simultaneous_builds or single_build_only if we're being binary about it.
  • Somehow get to that attribute here and check for it instead of name matching.

Also I was thinking this could be done to accept a number of builds, but I don't know if we have that information in the nextbuild callback. For Linaro's specific purpose right now, checking for 0 or 1 is fine, but as a general feature, "start a new build if active builds is < N" is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. The last commit adds max_simultaneous_builds to linaro-g3-* builders and use it to set nextBuild.
It seemed better to me to avoid changing BuilderConfig class, that is in another repository, so I added max_simultaneous_builds to the dictionaries that are used to create BuilderConfig instances.

@DavidSpickett
Copy link
Contributor

Also the context that may be missing here (and worth adding to the PR message) is as I understand it, there is a way to limit how many builds a single worker can run, but not how many builds a builder can be running on a pool of workers.

Builder A ---->|------------|
               |   Worker   |
Builder B ---->|------------|

Here we are already able to make sure that A and B are not building on the worker at the same time.

Builder A -->|------------|
           | | Worker 1   |   
           | |------------|
           |   
Builder B -->|------------|
             | Worker 2   |   
             |------------|

This PR aims to enable us to prevent Builder A from building on worker 1 and worker 2 at the same time. Starving B of resources.

(if the diagrams make any sense, feel free to steal them :) )

Limit the number of simultaneous builds using this new attribute,
instead of name matching.
@@ -442,6 +442,7 @@
'tags' : ["clang"],
'workernames' : ["linaro-g3-01", "linaro-g3-02", "linaro-g3-03", "linaro-g3-04"],
'builddir': "clang-aarch64-sve-vla",
'max_simultaneous_builds' : 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there are any documentation comments somewhere that we can add an explanation of this to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly to mention that the "default" is unlimited jobs per builder.

@DavidSpickett
Copy link
Contributor

Also I thought I had sent you on a wild goose chase, when I remembered there is the collapse requests feature. Then I remembered that we're already using that (it defaults to True) on the SVE builders. So this is not a possible solution.

@DavidSpickett
Copy link
Contributor

This looks good for Linaro's intent, but let's see if @gkistanova can tell us if we're reinventing the wheel here.

@gkistanova
Copy link
Contributor

Are you really want/need to limit a number of builds of particular configuration running on the pool? I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.

Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?

Let me think about this.

@luporl
Copy link
Contributor Author

luporl commented Jan 3, 2024

Are you really want/need to limit a number of builds of particular configuration running on the pool?

It's not really needed, but it's an easy way to avoid a particular configuration from using too many workers at the same time.

I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.

Right.

Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?

This would indeed be the best solution. I just don't know how to implement it :)

@gkistanova
Copy link
Contributor

Buildbot scheduler has been fixed. Now it should handle this case properly.
Could you see how it works for you, please?
If everything works as expected let's close this PR if you don't mind.

@luporl
Copy link
Contributor Author

luporl commented Jun 19, 2024

I am afraid Linaro's SVE builders are still not getting a fair schedule of the available workers.
For instance, right now clang-aarch64-sve-vla-2stage is using 3 workers, clang-aarch64-sve-vls-2stage is using 1 and no worker was allocated to clang-aarch64-sve-vls in the past 12 hours. If we look at its builds, we can see that more than 22 hours passed between builds 115 and 116.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants