Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for passing extended resources in node labels in GKE #7604

Merged
merged 4 commits into from
Feb 12, 2025

Conversation

mu-soliman
Copy link
Contributor

@mu-soliman mu-soliman commented Dec 13, 2024

/kind feature

What this PR does / why we need it:

on GKE, Cluster atuoscaler reads extended resource information from kubenv->AUTOSCALER_ENV_VARS->extended_resources in the managed scaling group template definition.

However, users have no way to add a variable to extended_resources, they are controlled from GKE side. This results in cluster autoscaler not knowing about extended resources and in return not supporting scale up from zero for all node pools that have extended resources (like GPU) on GKE.

On the other hand, node labels are passed from the node pool to the managed scaling group template through the kubenv->AUTOSCALER_ENV_VARS->node_labels.

This commit introduces the ability to pass extended resources to the cluster autoscaler as node labels with defined prefix on GKE, similar to how cluster autoscaler expects extended resources on AWS. This allows scaling from zero for node pools with extended resrouces.

on GKE, Adding node labels that start with "clusterautoscaler-nodetemplate-resources-", with value equal to the amount of the resources, allows the cluster autoscaler to detect extended resources and scale up the node pools from zero. 

on GCE, Cluster atuoscaler reads extended resource information from kubenv->AUTOSCALER_ENV_VARS->extended_resources in the managed scaling group template definition.

However, users have no way to add a variable to extended resources, they are controlled from GKE side. This results in cluster autoscaler not supporting scale up from zero for all node pools that has extended resources (like GPU) on GCE.

However, node labels are passed from the node pool to the managed scaling group template through the kubenv->AUTOSCALER_ENV_VARS->node_labels.

This commit introduces the ability to pass extended resources as node labels with defined prefix on GCE, similar to how cluster autoscaler expects extended resources on AWS. This allows scaling from zero for node pools with extended resrouces.
Copy link

linux-foundation-easycla bot commented Dec 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/gce cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 13, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mu-soliman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 13, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @mu-soliman!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 13, 2024
@whisperity
Copy link
Contributor

However, users have no way to add a variable to extended_resources, they are controlled from GKE side. This results in cluster autoscaler not knowing about extended resources and in return not supporting scale up from zero for all node pools that have extended resources (like GPU) on GCE.

So is this for GKE (cloud-managed Kubernetes) or GCE (virtual machines with self-managed Kubernetes)? I am asking because I am having troubles with scaling-up from 0 and looking into ways, but for me the kubelets are all self-installed, and I have no idea where the KUBE_ENV could be set.

@mu-soliman mu-soliman changed the title Add option for passing extended resources in node labels in GCE Add option for passing extended resources in node labels in GKE Dec 16, 2024
@mu-soliman
Copy link
Contributor Author

However, users have no way to add a variable to extended_resources, they are controlled from GKE side. This results in cluster autoscaler not knowing about extended resources and in return not supporting scale up from zero for all node pools that have extended resources (like GPU) on GCE.

So is this for GKE (cloud-managed Kubernetes) or GCE (virtual machines with self-managed Kubernetes)? I am asking because I am having troubles with scaling-up from 0 and looking into ways, but for me the kubelets are all self-installed, and I have no idea where the KUBE_ENV could be set.

On GKE cluster autoscaler is configured with cloudProvider parameter set to the value gce, I don't know why, probably for historic reasons, but the code change was done uder GCE subdirectory. The same cluster autoscaler code runs for both GKE and GCE.

The change I submitted was tested on GKE, so I updated the description and title to mention GKE, but I expect that it will run on GCE.

@whisperity
Copy link
Contributor

The change I submitted was tested on GKE, so I updated the description and title to mention GKE, but I expect that it will run on GCE.

@mu-soliman For reference, I found out that in case someone like me uses "pure" GCE, setting kube-env under the VM's Metadata: information it is possible to simulate the GKE behaviour (such as providing these extended resource flags). Which was nothing sort of a godsend that spared me of days of developing a patch related to it.

if err != nil {
return apiv1.ResourceList{}, err
}
const extendedResourcesKeyPrefix = "clusterautoscaler-nodetemplate-resources-"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Amazon and AWS providers (where these additional configuration options and flags are much better documented…) use k8s.io/cluster-autoscaler/node-template/… as the prefix for what the users should/could self-configure, maybe it would be wise to adhere to this pattern here if possible.

Copy link
Contributor Author

@mu-soliman mu-soliman Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not possible. You can have only one / at most in node labels. AWS has a separation between node labels and autoscaling group templates, GKE does not have such separation, and templates just copy attributes from node labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could this end with a dot? So that extended resource example.com/foo: 42 would be injected via clusterautoscaler-nodetemplate-resources.example.com/foo=42?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that at least Azure and Huawei providers have a similar limitation around slashes, and they get around it by replacing slashes with underscores:

Following this approach and using k8s.io_cluster-autoscaler_node-template_resources_ as the prefix seems strictly better than introducing a new, separate format just for GCE. Is there a reason why we can't go this way?

@mu-soliman
Copy link
Contributor Author

@BigDarkClown @jayantjain93 can you please review this pull request?

@jayantjain93
Copy link
Contributor

please assign this to @towca who could be an active reviewer. I'm currently not reviewing on the repo.

@mu-soliman
Copy link
Contributor Author

/assign @towca

@towca
Copy link
Collaborator

towca commented Jan 29, 2025

This PR effectively introduces a new workload-level API to Cluster Autoscaler, in a pretty hacky way. I see two main problems here:

  • We've agreed to align all such new workload-level APIs with Karpenter in the Alignment AEP. So at minimum this would require coordinating with the Karpenter stakeholders and converging to a version that works for both autoscalers. I'm not sure if specially-prefixed labels would survive these discussions.
  • Cluster Autoscaler has a lot of rough edges around extended resources - scale-from-0 is one of them. Another example is that if it takes any meaningful time to actually add the resources to the Nodes, CA will overshoot the scale-ups. We have these rough edges handled for the GPU case, but not arbitrary extended resources. DRA will effectively replace extended resources in the near future, and it actually makes it easier to solve these rough edges in a generic way (for scale-from-0, you'd just define the expected resources in a proper K8s object).

I fully get the scale-from-0 for extended resources problem here, but I'd strongly prefer not to introduce a temporary/redundant API for this.

I see the following workarounds until the proper DRA solution lands:

  • Keep at least 1 Node around by setting MinSize=1. This obviously has a negative impact on cost, especially if the extended resources are very expensive.
  • Manually swap the MIG instance templates and add the extended resources. However, this would have to be re-done after every MIG update/upgrade because it changes the instance template.

There are probably some things we could hack around on the GKE side if the above are not acceptable for you, but you should go through the GKE support channels for that.

I'm also curious about this part:

This results in cluster autoscaler not knowing about extended resources and in return not supporting scale up from zero for all node pools that have extended resources (like GPU) on GKE.

Are you really seeing this for regular GKE GPU node pools? If so, that's very likely a bug - scale-from-0 should work for all GPUs you can configure on a GKE node pool.

@mu-soliman
Copy link
Contributor Author

mu-soliman commented Jan 30, 2025

1- Keeping an idle node does not work for us, we run a huge fleet (we run a cloud service), in a lot of regions, in 3 zones in each region, with at least 18 node pools in each zone in each region in each cluster - and we have more than one cluster per region sometimes. Keeping one idle machine per each node pool makes our cost sky rocket. Manually swapping MIGs instances does not work when we have updates or create nodepools for infra at this size.
2- We contacted GCP support, after months of requests and multiple escalations and them trying to get us a meeting with the team responsible, we got a canned email response from the engineering team saying (we know about this issue) with suggested solution of not making node pool machines count go to zero, so basically the solution you suggested. No suggestions for when or if this will get addressed. I am ready to share with you our support engineers contacts to check with them if you are interested in following up on this.
3- We handle the wait for readiness of the extended resources by relying on startup taints that get removed when the extended resource is ready. We already do it on AWS with no problem for years now. For us, scaling from zero has rough edges only for CA on GCP.
4- We don't use GCP's Autoscaler as it does not allow us to configure a lot of parameters and enforce defaults that doesn't suit us, we disable it and use the opensource version, same as we do in AWS and azure.
5- Arranging with Karpenter is meaningless since we have no other options, GCP controls the way parameters are passed to CA. For example, there is a suggestion in the discussion above of using / instead of - like what happens in AWS, and it doesn't work because GCP ties node labels to ASG metadata, while other cloud service providers don't. Having a unified API means importing GCPs implementation restrictions to the others. We can have a discussion when we have options to choose from.

Side note: This is not the only example where GCPs obsessive control over parameters passed to autoscaler hit a wall, take for example the autodiscovery by asg prefix, where GCP enforces the naming convention gke-cluster-name at the start ASG name, BUT it truncates the cluster name if it is too long. This means that in our case where we have multiple clusters in the same account all with a similar first x characters in the cluster name, there is no way to correctly identify ASGs that back node pools in each cluster since the truncated cluster name is the same everywhere ! That is another issue, I am just giving you an example of the core problem at the heart of many problems with CA on GCP that will be imported to Karpenter if we try to unify the API, OR it will be rejected because no one will want to add restrictions that adds no value to themselves.

6- This is not a temporary / redundant new API, it already exists for AWS and Azure and GCP, with the GCP part being the one with messy implementation. The passing of extended resources info is already implemented on GCP part of CA, I am not introducing something new, I am just moving it from one place in the ASG metadata that the user have no control over to another one beside it that the user can set with partial freedom. When extended resources is deprecated (which will not be soon) we can remove this part.

@towca
Copy link
Collaborator

towca commented Jan 31, 2025

4- We don't use GCP's Autoscaler as it does not allow us to configure a lot of parameters and enforce defaults that doesn't suit us, we disable it and use the opensource version, same as we do in AWS and azure.

Ah, this explains things and also takes away the GKE support channel option (OSS CA is not supported in GKE clusters).

5- Arranging with Karpenter is meaningless since we have no other options, GCP controls the way parameters are passed to CA. For example, there is a suggestion in the discussion above of using / instead of - like what happens in AWS, and it doesn't work because GCP ties node labels to ASG metadata, while other cloud service providers don't. Having a unified API means importing GCPs implementation restrictions to the others. We can have a discussion when we have options to choose from.

Not sure what you mean by meaningless here. As SIG Autoscaling, we made a commitment to align on any new labels that Cluster Autoscaler or Karpenter react to. It certainly has meaning for us to honor this commitment.

Side note: This is not the only example where GCPs obsessive control over parameters passed to autoscaler hit a wall, take for example the autodiscovery by asg prefix, where GCP enforces the naming convention gke-cluster-name at the start ASG name, BUT it truncates the cluster name if it is too long. This means that in our case where we have multiple clusters in the same account all with a similar first x characters in the cluster name, there is no way to correctly identify ASGs that back node pools in each cluster since the truncated cluster name is the same everywhere !

Again, I'm not sure I understand this part. If you set up OSS CA on bare GCE, there are no restrictions on how you can configure it. In this case, you can just set the correct extended resources in the instance template - which is the standard interface for configuring NodeGroups for OSS CA on GCE.

If you use GKE and the GKE CA, there are indeed a lot config restrictions that allow us to support it on a huge scale (but the resources GKE allows you to configure should work correctly). Using the OSS CA in a GKE cluster isn't really something we officially support from either the GKE or OSS CA side. It's not surprising that you're hitting bugs and poor UX in this case.

That is another issue, I am just giving you an example of the core problem at the heart of many problems with CA on GCP that will be imported to Karpenter if we try to unify the API, OR it will be rejected because no one will want to add restrictions that adds no value to themselves.

The point of the alignment agreement is to design all new workload-level/label-based APIs in a way that doesn't cause problems for either CA or Karpenter. This requires design and discussion, which didn't happen. The outcome of such discussion might very well be deciding that something is CA/GCP-specific and doesn't require implementation from Karpenter, or rejecting the idea altogether. What I'm saying is that we can't just skip this alignment step if we're adding a new API.

6- This is not a temporary / redundant new API, it already exists for AWS and Azure and GCP, with the GCP part being the one with messy implementation. The passing of extended resources info is already implemented on GCP part of CA, I am not introducing something new, I am just moving it from one place in the ASG metadata that the user have no control over to another one beside it that the user can set with partial freedom. When extended resources is deprecated (which will not be soon) we can remove this part.

I could see that if you were at least using the same labels but you're adding a new one. The label (or label prefix) constitutes the API. Why not just reuse k8s.io/cluster-autoscaler/node-template/resources/?

By "redundant" I meant that it should be soon made obsolete by the upcoming DRA changes. But if we don't need to introduce new labels for it, I guess it shouldn't hurt until then.

@gjtempleton @jackfrancis @x13n I'm curious to hear your thoughts on this.


BTW, have you considered using your own fork of Cluster Autoscaler? As mentioned above, your setup is not really officially supported either by GKE, or by Kubernetes/SIG Autoscaling. You're probably bound to run into more issues because of it, and attempting to fix them upstream won't always make sense.

@jackfrancis
Copy link
Contributor

@towca I'm not a GCE provider maintainer (I'm an Azure provider maintainer), but we have similar issues with folks running the OSS CAS on AKS: AKS has its own CAS implementation that is tuned for AKS features, and in similar ways to Kube's description intentionally scoped to enable support as part of the managed service SLA. It is possible to use AKS w/ the OSS CAS, but with no guarantees. The two main considerations for considering whether to adding AKS-specific foo to the Azure provider are:

  1. Does this take a dependency upon AKS behaviors or APIs that are subject to change by the AKS product effort (which is by definition an opinionated, Azure-managed Kubernetes)?
  2. Does this have any negative effect on the BYO Azure Kubernetes cluster scenario?

If the answer is yes to either of the above, we would not be able to accept that.

For the record, it'd be great (IMO) if we can get more of the OSS code into the managed product offerings across cloud providers, being able to solicit user feedback directly like this is a great benefit of using the OSS solution. We're not there at the moment.

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just a minor comment.

It would be good to have a cloud provider agnostic way of doing this, but addressing GKE gap in the short term makes sense to me.

if err != nil {
return apiv1.ResourceList{}, err
}
const extendedResourcesKeyPrefix = "clusterautoscaler-nodetemplate-resources-"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could this end with a dot? So that extended resource example.com/foo: 42 would be injected via clusterautoscaler-nodetemplate-resources.example.com/foo=42?

mu-soliman and others added 2 commits February 12, 2025 10:11
change last character in extended resources prefix to be `.` instead of `-`.
Add a warning if the extended resource already exists.
@towca
Copy link
Collaborator

towca commented Feb 12, 2025

Repeating my last comment here for visibility (it's on an outdated file):


It seems that at least Azure and Huawei providers have a similar limitation around slashes, and they get around it by replacing slashes with underscores:

Following this approach and using k8s.io_cluster-autoscaler_node-template_resources_ as the prefix seems strictly better than introducing a new, separate format just for GCE. Is there a reason why we can't go this way?

@mu-soliman
Copy link
Contributor Author

mu-soliman commented Feb 12, 2025

Repeating my last comment here for visibility (it's on an outdated file):

It seems that at least Azure and Huawei providers have a similar limitation around slashes, and they get around it by replacing slashes with underscores:

* https://github.com/kubernetes/autoscaler/blob/4f98ba196da8ef8d89ee1a8595a6c284e24baafb/cluster-autoscaler/cloudprovider/azure/README.md?plain=1#L53

* https://github.com/kubernetes/autoscaler/blob/4f98ba196da8ef8d89ee1a8595a6c284e24baafb/cluster-autoscaler/cloudprovider/huaweicloud/huaweicloud_service_manager.go#L636

Following this approach and using k8s.io_cluster-autoscaler_node-template_resources_ as the prefix seems strictly better than introducing a new, separate format just for GCE. Is there a reason why we can't go this way?

All cloud service providers allow putting tags on autoscaling groups or virtual machine scaling sets directly. GKE does not allow such thing.

The solution suggested in this pull request reads k8s node labels, defined by the user for node pools, and that are copied by GKE into ASG templates metadata.

Because they are originally k8s node labels they have to abide by k8s node label naming rules. One of those rules is to have at most one - character.

k8s.io_cluster-autoscaler_node-template_resources_ has two - characters so it cannot be used.

I thought I clarified this in the zoom meeting and this is why I didn't want to repeat myself by answering it here in the comments. Sorry if it was not clear.

Also I clarified this to Daniel in another meeting with google engineers.

@towca
Copy link
Collaborator

towca commented Feb 12, 2025

The solution suggested in this pull request reads k8s node labels, defined by the user for node pools, and that are copied by GKE into ASG templates metadata.

My bad, I was somehow under the impression that they were using k8s labels as well, not provider-specific tags.

Because they are originally k8s node labels they have to abide by k8s node label naming rules. One of those rules is to have at most one - character.

k8s.io_cluster-autoscaler_node-template_resources_ has two - characters so it cannot be used.

I think the problem for k8s labels is the underscore being there at all in the domain part, not more than one dash (see https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set, http://www.tcpipguide.com/free/t_DNSLabelsNamesandSyntaxRules.htm). But anyway, you're right - that format won't work. In this case the proposed one LGTM, it seems as close as possible to the other one while adhering to k8s label rules.

I thought I clarified this in the zoom meeting and this is why I didn't want to repeat myself by answering it here in the comments. Sorry if it was not clear.

No that's probably on me, must've been the part when I had audio issues. Sorry for that.

/lgtm
/approve

For future reference, #7799 tracks an effort that would allow configuring extended resources and other Node template parts in a provider-agnostic way.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mu-soliman, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 6cacce1 into kubernetes:master Feb 12, 2025
6 checks passed
@mu-soliman
Copy link
Contributor Author

/cherry-pick cluster-autoscaler-release-1.30

@k8s-infra-cherrypick-robot

@mu-soliman: only kubernetes org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick cluster-autoscaler-release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@x13n
Copy link
Member

x13n commented Mar 27, 2025

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32

@k8s-infra-cherrypick-robot

@x13n: new pull request created: #7986

In response to this:

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@x13n: new pull request created: #7987

In response to this:

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@x13n: new pull request created: #7988

In response to this:

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/gce cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants