Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kep 4986 sakkara cluster topology #837

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atantawi
Copy link
Contributor

@atantawi atantawi commented Dec 10, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add kep for Sakkara plugin, a hierarchical cluster topology group scheduler

Which issue(s) this PR fixes:

Fixes #4986
kubernetes/enhancements#4986

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 10, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @atantawi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 10, 2024
Copy link

netlify bot commented Dec 10, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit 02fd878
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6794159a1819a100082e534f

@googs1025
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 13, 2024
@atantawi
Copy link
Contributor Author

atantawi commented Jan 7, 2025

/retest-required

@atantawi
Copy link
Contributor Author

atantawi commented Jan 7, 2025

/retest

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atantawi
Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@13567436138
Copy link

when to implement this plugin?

@atantawi
Copy link
Contributor Author

when to implement this plugin?

I am planning to raise a PR and put the code in there shortly.

@13567436138
Copy link

@atantawi awesome

@googs1025
Copy link
Member

/cc

I will try to understand this plugin.

@13567436138
Copy link

How short
I can not wait

@atantawi
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@atantawi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-scheduler-plugins-verify 02fd878 link true /test pull-scheduler-plugins-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +45 to +55
1- **User A:**

Want to deploy a job with 8 pods, equally partitioned into two partitions, where each partition is placed in separate rack, and the pods belonging to a partition are packed on nodes within the rack. Partitioning improves availability, and packing improves performance.

![before-1](images/use-cases/before-1.png)

![after-1](images/use-cases/after-1.png)

2- **User B:**

Want to place a job with 6 pods packed on same physical server, as much as possible, where pods on the same server are spread across nodes in that server, but in the range [2,4]. Having all pods on the same physical server improves communication among the pods. And, having less than 2 pods in a node spreads the pods too thinly, and having more than 4 may leads to congestion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial thought for me is: can similar problems be solved by using existing affinity or anti-affinity or Topology Spread Constraints? But this seems to be "multi-level", and there are many things to consider. 🤔


### Cluster topology

The cluster topology is specified as a tree where the leaves are the nodes in the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether clusters need such a complex definition (at least I do not have such a complex definition internally... this might be my problem). In my work, there is only a one-dimensional level, which is described as "node pool". And I optimize scheduling based on this dimension. 🤔

- *level-names*: Names of the levels as an array of strings ordered top to bottom (excluding the root).
- *tree*: String specification of the tree.

### Group constraints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question: will the number of configmaps for group constraints increase as more group settings are added? Can we merge configs into the same one? More configs sometime mean more difficult maintenance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one configMap (alternatively, a CR or an extension of the PodGroup) per group. Group constraints is a key in the map (or a spec).

@atantawi
Copy link
Contributor Author

atantawi commented Feb 3, 2025

How short I can not wait

#877


```yaml
apiVersion: v1
kind: ConfigMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for the KEP, but I highly recommend to start with CRD since day 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Should we introduce a new CR or extend the PodGroup?

Comment on lines +119 to +121
"node-0": {},
"node-1": {},
"node-2": {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we present the topological info if the cluster has a couple of thousands nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! We need a tool to generate/update the tree definition. Or, we should consider an alternative way to code the tree. One possibility is to have a node (a leaf in the cluster topology tree) code the path to the root as labels. Problem is the plugin will need to create the tree every time a group needs to be scheduled. (Unless it's cached and updated whenever node labels change.)

sakkara.member.status: Bound
```

The rank, status, and number of placement trials for the pod are provided as labels. This way, an (init) container in the pod may read the rank label and configure itself accordingly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an anti-pattern to update a pod's labels to store extra info. Update a pod is expensive, no need to say updating a gang of pods.

Copy link
Contributor Author

@atantawi atantawi Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important data is the rank. Having it as a pod label simplifies the logic of the container getting it initially as an environment variable. Alternatively, rank values for all pods in the group may be written in the CR for the group. The challenge is for the containers to get them? I guess if we're using a configmap for the group, the container could mount it.


![architecture](images/architecture.png)

![plugin state transitions](images/state-transitions.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the Design section, I still cannot figure out the exact responsibility of each phase. Like, which phase does the "pre-scheduling" to update the pod's label, what if the "pre-scheduling" result is in conflict with the real-time scheduling decision, how preemption works, etc.

It's quite essential to illustrate: by given a group of pods carrying the group constraints, in an e2e manner how they're processed in each phase of a scheduling cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll work on providing a clarifying diagram and maybe a table with description.

@alculquicondor
Copy link

cc @dom4ha @sanposhiho

@dom4ha
Copy link

dom4ha commented Feb 7, 2025

/cc @johnbelamaric
There are a few aspects that we should discuss regarding the overall approach to topology aware scheduling

  1. Resource management in scheduler becomes more complex, mainly because of rapid DRA development. Some of the resource types that are being introduced already enforce topology constraints implicitly (see KEP-4815 DRA Partitionable devices support for multi-host kubernetes/enhancements#5069). This mean that we cannot assume that there is only one topology.
  2. DRA becomes a new method to define and manage resource (especially dynamically allocated), but also, it introduces a new way of expressing resource requirements and evolve toward covering all type of resources that pods may need to run.
  3. We (and a few other parties) are currently exploring a possibility of pre-allocating (reserving) resources. There is a proposal for making the first such attempt to leverage DRA framework for that. Note that reservation not necessarily need to consider final Pods at the time of allocation, but may need to consider topology (directly or indirectly)
  4. More fine grained allocations (not proposed yet anywhere) could become dynamically scheduled/allocated (somehow) and will not leave too much space for scheduler to decide on the final pods placement.
  5. Scheduler might play the crucial role in dynamic scheduling/allocation of such resources/reservations, but this is a very new and vague idea not discussed/brainstormed on the community forum yet (except the last SIG meeting). Note however that ResourceClaims themselves are in fact a sort of reservation method already.
  6. Finally, any TAS implementation makes sense when we talk about group/gang scheduling. This concept is not well defined yet at Kubernetes level, so before we provide solutions for it, we need to introduce such well thought-out concept, which may be a challenge in itself. I know that gang-scheduling can be modeled in the current scheduler framework, but not sure if this is sufficient for long term maintainabliity.

I believe that TAS scheduling is important and it's great that you are sharing your approach and offering contribution. I believe that we should think how the proposal interacts with the aspects that I mentioned (frankly, I know questions but not answers yet). I'm especially curious whether the reservation ideas is something that we want to pursue (as community).

To scope the discussion for now, my major concern is whether the proposal that is based on topology scheduling enclosed into a single plugin is capable to take into consideration external constraints (including implicit topology constraints). IIUC the solver knows the topology, knows the group of pods, but does it know about external constraints put by other plugins?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants