Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Doc] TAS's document is wrong #3481

Open
KunWuLuan opened this issue Nov 7, 2024 · 4 comments · May be fixed by #3488
Open

[Bug] [Doc] TAS's document is wrong #3481

KunWuLuan opened this issue Nov 7, 2024 · 4 comments · May be fixed by #3488
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@KunWuLuan
Copy link
Member

What happened:
image

What you expected to happen:
I think this should be cloud.provider.com/topology-block

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@KunWuLuan KunWuLuan added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2024
@KunWuLuan KunWuLuan changed the title [Bug] [Doc ] [Bug] [Doc] TAS's document is wrong Nov 7, 2024
@mimowo
Copy link
Contributor

mimowo commented Nov 7, 2024

Yeah, this is corrupted :) , but it should not be cloud.provider.com/topology-block.

The role of the nodeLabel is to restrict TAS nodes to a dedicated subset of nodes. You may want to do that to for example use TAS only on GPU nodes, or exclude control plane nodes. Or have two disjoint TAS pools.

This may depend on the cloud provider and your use case. For example, on GKE you may want to restrict it to a dedicted node pool for GPU using: cloud.google.com/gke-nodepool: tas-a100-pool. Still, this node group may contain many blocks or racks.

Surely we should adjust the docs, by renaming it to, for example: cloud.provider.com/node-group: tas-node-group + we can better explain the role of the label in the text.

@mimowo
Copy link
Contributor

mimowo commented Nov 7, 2024

@KunWuLuan feel free to submit a PR

@KunWuLuan
Copy link
Member Author

The role of the nodeLabel is to restrict TAS nodes to a dedicated subset of nodes. You may want to do that to for example use TAS only on GPU nodes, or exclude control plane nodes. Or have two disjoint TAS pools.

Thanks mimowo. Let me understand, if I set cloud.provider.com/topology-block=tas-node-group in RF, then the podSet will only be scheduled on node with cloud.provider.com/topology-block=tas-node-group, right?
What if only some of the labels are set on nodes?

@KunWuLuan KunWuLuan linked a pull request Nov 8, 2024 that will close this issue
@mimowo
Copy link
Contributor

mimowo commented Nov 8, 2024

if I set cloud.provider.com/topology-block=tas-node-group in RF, then the podSet will only be scheduled on node with cloud.provider.com/topology-block=tas-node-group, right?

This is not the intention of the cloud.provider.com/topology-block label. The cloud.provider.com/topology-block is meant to be used in the Topology API to denote the "block" level.

The label on the ResourceFlavor is meant to constrain the pool of nodes dedicated to TAS. For example, for GKE a good candidate might be cloud.google.com/gke-nodepool label, but it will depend on the use case.

What if only some of the labels are set on nodes?

TAS only considers nodes which:

  • have the same key and value as in the ResourceFlavor spec.nodeLabels
  • have the same key as in Topology, spec.levels.nodeLabel[*]

cc @mwysokin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants