-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for passing extended resources in node labels in GKE #7604
Add option for passing extended resources in node labels in GKE #7604
Conversation
on GCE, Cluster atuoscaler reads extended resource information from kubenv->AUTOSCALER_ENV_VARS->extended_resources in the managed scaling group template definition. However, users have no way to add a variable to extended resources, they are controlled from GKE side. This results in cluster autoscaler not supporting scale up from zero for all node pools that has extended resources (like GPU) on GCE. However, node labels are passed from the node pool to the managed scaling group template through the kubenv->AUTOSCALER_ENV_VARS->node_labels. This commit introduces the ability to pass extended resources as node labels with defined prefix on GCE, similar to how cluster autoscaler expects extended resources on AWS. This allows scaling from zero for node pools with extended resrouces.
Hi @mu-soliman. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Welcome @mu-soliman! |
So is this for GKE (cloud-managed Kubernetes) or GCE (virtual machines with self-managed Kubernetes)? I am asking because I am having troubles with scaling-up from 0 and looking into ways, but for me the |
On GKE cluster autoscaler is configured with cloudProvider parameter set to the value The change I submitted was tested on GKE, so I updated the description and title to mention GKE, but I expect that it will run on GCE. |
@mu-soliman For reference, I found out that in case someone like me uses "pure" GCE, setting |
if err != nil { | ||
return apiv1.ResourceList{}, err | ||
} | ||
const extendedResourcesKeyPrefix = "clusterautoscaler-nodetemplate-resources-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Amazon and AWS providers (where these additional configuration options and flags are much better documented…) use k8s.io/cluster-autoscaler/node-template/…
as the prefix for what the users should/could self-configure, maybe it would be wise to adhere to this pattern here if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is not possible. You can have only one /
at most in node labels. AWS has a separation between node labels and autoscaling group templates, GKE does not have such separation, and templates just copy attributes from node labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could this end with a dot? So that extended resource example.com/foo: 42
would be injected via clusterautoscaler-nodetemplate-resources.example.com/foo=42
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that at least Azure and Huawei providers have a similar limitation around slashes, and they get around it by replacing slashes with underscores:
k8s.io_cluster-autoscaler_node-template_resources_cpu: 3800m autoscaler/cluster-autoscaler/cloudprovider/huaweicloud/huaweicloud_service_manager.go
Line 636 in 4f98ba1
// The tag is of the format "k8s.io_cluster-autoscaler_node-template_taint_<taint-key>". "<taint-key>" is
Following this approach and using k8s.io_cluster-autoscaler_node-template_resources_
as the prefix seems strictly better than introducing a new, separate format just for GCE. Is there a reason why we can't go this way?
@BigDarkClown @jayantjain93 can you please review this pull request? |
please assign this to @towca who could be an active reviewer. I'm currently not reviewing on the repo. |
/assign @towca |
This PR effectively introduces a new workload-level API to Cluster Autoscaler, in a pretty hacky way. I see two main problems here:
I fully get the scale-from-0 for extended resources problem here, but I'd strongly prefer not to introduce a temporary/redundant API for this. I see the following workarounds until the proper DRA solution lands:
There are probably some things we could hack around on the GKE side if the above are not acceptable for you, but you should go through the GKE support channels for that. I'm also curious about this part:
Are you really seeing this for regular GKE GPU node pools? If so, that's very likely a bug - scale-from-0 should work for all GPUs you can configure on a GKE node pool. |
1- Keeping an idle node does not work for us, we run a huge fleet (we run a cloud service), in a lot of regions, in 3 zones in each region, with at least 18 node pools in each zone in each region in each cluster - and we have more than one cluster per region sometimes. Keeping one idle machine per each node pool makes our cost sky rocket. Manually swapping MIGs instances does not work when we have updates or create nodepools for infra at this size. Side note: This is not the only example where GCPs obsessive control over parameters passed to autoscaler hit a wall, take for example the autodiscovery by asg prefix, where GCP enforces the naming convention 6- This is not a temporary / redundant new API, it already exists for AWS and Azure and GCP, with the GCP part being the one with messy implementation. The passing of extended resources info is already implemented on GCP part of CA, I am not introducing something new, I am just moving it from one place in the ASG metadata that the user have no control over to another one beside it that the user can set with partial freedom. When extended resources is deprecated (which will not be soon) we can remove this part. |
Ah, this explains things and also takes away the GKE support channel option (OSS CA is not supported in GKE clusters).
Not sure what you mean by meaningless here. As SIG Autoscaling, we made a commitment to align on any new labels that Cluster Autoscaler or Karpenter react to. It certainly has meaning for us to honor this commitment.
Again, I'm not sure I understand this part. If you set up OSS CA on bare GCE, there are no restrictions on how you can configure it. In this case, you can just set the correct extended resources in the instance template - which is the standard interface for configuring NodeGroups for OSS CA on GCE. If you use GKE and the GKE CA, there are indeed a lot config restrictions that allow us to support it on a huge scale (but the resources GKE allows you to configure should work correctly). Using the OSS CA in a GKE cluster isn't really something we officially support from either the GKE or OSS CA side. It's not surprising that you're hitting bugs and poor UX in this case.
The point of the alignment agreement is to design all new workload-level/label-based APIs in a way that doesn't cause problems for either CA or Karpenter. This requires design and discussion, which didn't happen. The outcome of such discussion might very well be deciding that something is CA/GCP-specific and doesn't require implementation from Karpenter, or rejecting the idea altogether. What I'm saying is that we can't just skip this alignment step if we're adding a new API.
I could see that if you were at least using the same labels but you're adding a new one. The label (or label prefix) constitutes the API. Why not just reuse By "redundant" I meant that it should be soon made obsolete by the upcoming DRA changes. But if we don't need to introduce new labels for it, I guess it shouldn't hurt until then. @gjtempleton @jackfrancis @x13n I'm curious to hear your thoughts on this. BTW, have you considered using your own fork of Cluster Autoscaler? As mentioned above, your setup is not really officially supported either by GKE, or by Kubernetes/SIG Autoscaling. You're probably bound to run into more issues because of it, and attempting to fix them upstream won't always make sense. |
@towca I'm not a GCE provider maintainer (I'm an Azure provider maintainer), but we have similar issues with folks running the OSS CAS on AKS: AKS has its own CAS implementation that is tuned for AKS features, and in similar ways to Kube's description intentionally scoped to enable support as part of the managed service SLA. It is possible to use AKS w/ the OSS CAS, but with no guarantees. The two main considerations for considering whether to adding AKS-specific foo to the Azure provider are:
If the answer is yes to either of the above, we would not be able to accept that. For the record, it'd be great (IMO) if we can get more of the OSS code into the managed product offerings across cloud providers, being able to solicit user feedback directly like this is a great benefit of using the OSS solution. We're not there at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, just a minor comment.
It would be good to have a cloud provider agnostic way of doing this, but addressing GKE gap in the short term makes sense to me.
if err != nil { | ||
return apiv1.ResourceList{}, err | ||
} | ||
const extendedResourcesKeyPrefix = "clusterautoscaler-nodetemplate-resources-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could this end with a dot? So that extended resource example.com/foo: 42
would be injected via clusterautoscaler-nodetemplate-resources.example.com/foo=42
?
change last character in extended resources prefix to be `.` instead of `-`. Add a warning if the extended resource already exists.
Repeating my last comment here for visibility (it's on an outdated file): It seems that at least Azure and Huawei providers have a similar limitation around slashes, and they get around it by replacing slashes with underscores:
Following this approach and using |
All cloud service providers allow putting tags on autoscaling groups or virtual machine scaling sets directly. GKE does not allow such thing. The solution suggested in this pull request reads k8s node labels, defined by the user for node pools, and that are copied by GKE into ASG templates metadata. Because they are originally k8s node labels they have to abide by k8s node label naming rules. One of those rules is to have at most one
I thought I clarified this in the zoom meeting and this is why I didn't want to repeat myself by answering it here in the comments. Sorry if it was not clear. Also I clarified this to Daniel in another meeting with google engineers. |
My bad, I was somehow under the impression that they were using k8s labels as well, not provider-specific tags.
I think the problem for k8s labels is the underscore being there at all in the domain part, not more than one dash (see https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set, http://www.tcpipguide.com/free/t_DNSLabelsNamesandSyntaxRules.htm). But anyway, you're right - that format won't work. In this case the proposed one LGTM, it seems as close as possible to the other one while adhering to k8s label rules.
No that's probably on me, must've been the part when I had audio issues. Sorry for that. /lgtm For future reference, #7799 tracks an effort that would allow configuring extended resources and other Node template parts in a provider-agnostic way. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mu-soliman, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick cluster-autoscaler-release-1.30 |
@mu-soliman: only kubernetes org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick cluster-autoscaler-release-1.30 |
@x13n: new pull request created: #7986 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@x13n: new pull request created: #7987 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@x13n: new pull request created: #7988 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind feature
What this PR does / why we need it:
on GKE, Cluster atuoscaler reads extended resource information from kubenv->AUTOSCALER_ENV_VARS->extended_resources in the managed scaling group template definition.
However, users have no way to add a variable to extended_resources, they are controlled from GKE side. This results in cluster autoscaler not knowing about extended resources and in return not supporting scale up from zero for all node pools that have extended resources (like GPU) on GKE.
On the other hand, node labels are passed from the node pool to the managed scaling group template through the kubenv->AUTOSCALER_ENV_VARS->node_labels.
This commit introduces the ability to pass extended resources to the cluster autoscaler as node labels with defined prefix on GKE, similar to how cluster autoscaler expects extended resources on AWS. This allows scaling from zero for node pools with extended resrouces.