add kep 4986 sakkara cluster topology #837

atantawi · 2024-12-10T20:20:57Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add kep for Sakkara plugin, a hierarchical cluster topology group scheduler

Which issue(s) this PR fixes:

Fixes #4986
kubernetes/enhancements#4986

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-12-10T20:21:08Z

Hi @atantawi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-12-10T20:21:17Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`02fd878`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6794159a1819a100082e534f

googs1025 · 2024-12-13T00:43:35Z

/ok-to-test

atantawi · 2025-01-07T19:41:35Z

/retest-required

atantawi · 2025-01-07T19:47:08Z

/retest

k8s-ci-robot · 2025-01-24T22:35:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atantawi
Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

13567436138 · 2025-01-27T11:21:20Z

when to implement this plugin？

atantawi · 2025-01-27T15:17:43Z

when to implement this plugin？

I am planning to raise a PR and put the code in there shortly.

13567436138 · 2025-01-28T01:22:35Z

@atantawi awesome

googs1025 · 2025-01-28T02:03:54Z

/cc

I will try to understand this plugin.

13567436138 · 2025-01-28T13:39:53Z

How short
I can not wait

atantawi · 2025-01-30T21:43:48Z

/retest

k8s-ci-robot · 2025-01-30T21:46:03Z

@atantawi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-scheduler-plugins-verify	`02fd878`	link	true	`/test pull-scheduler-plugins-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

googs1025 · 2025-02-03T06:15:28Z

kep/4986-sakkara-cluster-topology/README.md

+1- **User A:**
+
+Want to deploy a job with 8 pods, equally partitioned into two partitions, where each partition is placed in separate rack, and the pods belonging to a partition are packed on nodes within the rack. Partitioning improves availability, and packing improves performance.
+
+![before-1](images/use-cases/before-1.png)
+
+![after-1](images/use-cases/after-1.png)
+
+2- **User B:**
+
+Want to place a job with 6 pods packed on same physical server, as much as possible, where pods on the same server are spread across nodes in that server, but in the range [2,4]. Having all pods on the same physical server improves communication among the pods. And, having less than 2 pods in a node spreads the pods too thinly, and having more than 4 may leads to congestion.


initial thought for me is: can similar problems be solved by using existing affinity or anti-affinity or Topology Spread Constraints? But this seems to be "multi-level", and there are many things to consider. 🤔

googs1025 · 2025-02-03T06:21:42Z

kep/4986-sakkara-cluster-topology/README.md

+
+### Cluster topology
+
+The cluster topology is specified as a tree where the leaves are the nodes in the cluster.


I am not sure whether clusters need such a complex definition (at least I do not have such a complex definition internally... this might be my problem). In my work, there is only a one-dimensional level, which is described as "node pool". And I optimize scheduling based on this dimension. 🤔

googs1025 · 2025-02-03T06:29:52Z

kep/4986-sakkara-cluster-topology/README.md

+- *level-names*: Names of the levels as an array of strings ordered top to bottom (excluding the root).
+- *tree*: String specification of the tree.
+
+### Group constraints


one question: will the number of configmaps for group constraints increase as more group settings are added? Can we merge configs into the same one? More configs sometime mean more difficult maintenance

There is only one configMap (alternatively, a CR or an extension of the PodGroup) per group. Group constraints is a key in the map (or a spec).

atantawi · 2025-02-03T21:20:29Z

How short I can not wait

#877

Huang-Wei · 2025-02-06T06:39:19Z

kep/4986-sakkara-cluster-topology/README.md

+
+```yaml
+apiVersion: v1
+kind: ConfigMap


Not a blocker for the KEP, but I highly recommend to start with CRD since day 1.

Agreed. Should we introduce a new CR or extend the PodGroup?

Huang-Wei · 2025-02-06T06:40:36Z

kep/4986-sakkara-cluster-topology/README.md

+        "node-0": {},
+        "node-1": {},
+        "node-2": {}


How do we present the topological info if the cluster has a couple of thousands nodes?

Good question! We need a tool to generate/update the tree definition. Or, we should consider an alternative way to code the tree. One possibility is to have a node (a leaf in the cluster topology tree) code the path to the root as labels. Problem is the plugin will need to create the tree every time a group needs to be scheduled. (Unless it's cached and updated whenever node labels change.)

Huang-Wei · 2025-02-06T06:43:21Z

kep/4986-sakkara-cluster-topology/README.md

+    sakkara.member.status: Bound
+```
+
+The rank, status, and number of placement trials for the pod are provided as labels. This way, an (init) container in the pod may read the rank label and configure itself accordingly.


It's an anti-pattern to update a pod's labels to store extra info. Update a pod is expensive, no need to say updating a gang of pods.

The most important data is the rank. Having it as a pod label simplifies the logic of the container getting it initially as an environment variable. Alternatively, rank values for all pods in the group may be written in the CR for the group. The challenge is for the containers to get them? I guess if we're using a configmap for the group, the container could mount it.

Huang-Wei · 2025-02-06T06:47:40Z

kep/4986-sakkara-cluster-topology/README.md

+
+![architecture](images/architecture.png)
+
+![plugin state transitions](images/state-transitions.png)


After reading the Design section, I still cannot figure out the exact responsibility of each phase. Like, which phase does the "pre-scheduling" to update the pod's label, what if the "pre-scheduling" result is in conflict with the real-time scheduling decision, how preemption works, etc.

It's quite essential to illustrate: by given a group of pods carrying the group constraints, in an e2e manner how they're processed in each phase of a scheduling cycle.

OK, I'll work on providing a clarifying diagram and maybe a table with description.

alculquicondor · 2025-02-06T18:53:11Z

cc @dom4ha @sanposhiho

dom4ha · 2025-02-07T00:18:40Z

/cc @johnbelamaric
There are a few aspects that we should discuss regarding the overall approach to topology aware scheduling

Resource management in scheduler becomes more complex, mainly because of rapid DRA development. Some of the resource types that are being introduced already enforce topology constraints implicitly (see KEP-4815 DRA Partitionable devices support for multi-host kubernetes/enhancements#5069). This mean that we cannot assume that there is only one topology.
DRA becomes a new method to define and manage resource (especially dynamically allocated), but also, it introduces a new way of expressing resource requirements and evolve toward covering all type of resources that pods may need to run.
We (and a few other parties) are currently exploring a possibility of pre-allocating (reserving) resources. There is a proposal for making the first such attempt to leverage DRA framework for that. Note that reservation not necessarily need to consider final Pods at the time of allocation, but may need to consider topology (directly or indirectly)
More fine grained allocations (not proposed yet anywhere) could become dynamically scheduled/allocated (somehow) and will not leave too much space for scheduler to decide on the final pods placement.
Scheduler might play the crucial role in dynamic scheduling/allocation of such resources/reservations, but this is a very new and vague idea not discussed/brainstormed on the community forum yet (except the last SIG meeting). Note however that ResourceClaims themselves are in fact a sort of reservation method already.
Finally, any TAS implementation makes sense when we talk about group/gang scheduling. This concept is not well defined yet at Kubernetes level, so before we provide solutions for it, we need to introduce such well thought-out concept, which may be a challenge in itself. I know that gang-scheduling can be modeled in the current scheduler framework, but not sure if this is sufficient for long term maintainabliity.

I believe that TAS scheduling is important and it's great that you are sharing your approach and offering contribution. I believe that we should think how the proposal interacts with the aspects that I mentioned (frankly, I know questions but not answers yet). I'm especially curious whether the reservation ideas is something that we want to pursue (as community).

To scope the discussion for now, my major concern is whether the proposal that is based on topology scheduling enclosed into a single plugin is capable to take into consideration external constraints (including implicit topology constraints). IIUC the solver knows the topology, knows the group of pods, but does it know about external constraints put by other plugins?

k8s-ci-robot requested review from seanmalloy and zwpaper December 10, 2024 20:21

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 10, 2024

kannon92 mentioned this pull request Dec 11, 2024

Sakkara: A hierarchical cluster topology group scheduler plugin kubernetes/enhancements#4986

Closed

4 tasks

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 13, 2024

add kep 4986 sakkara cluster topology

02fd878

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 24, 2025

atantawi force-pushed the kep branch from 55c7efa to 02fd878 Compare January 24, 2025 22:35

k8s-ci-robot requested a review from googs1025 January 28, 2025 02:03

googs1025 reviewed Feb 3, 2025

View reviewed changes

Huang-Wei reviewed Feb 6, 2025

View reviewed changes

k8s-ci-robot requested a review from johnbelamaric February 7, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add kep 4986 sakkara cluster topology #837

add kep 4986 sakkara cluster topology #837

atantawi commented Dec 10, 2024 •

edited

Loading

k8s-ci-robot commented Dec 10, 2024

netlify bot commented Dec 10, 2024 •

edited

Loading

googs1025 commented Dec 13, 2024

atantawi commented Jan 7, 2025

atantawi commented Jan 7, 2025

k8s-ci-robot commented Jan 24, 2025

13567436138 commented Jan 27, 2025

atantawi commented Jan 27, 2025

13567436138 commented Jan 28, 2025

googs1025 commented Jan 28, 2025

13567436138 commented Jan 28, 2025

atantawi commented Jan 30, 2025

k8s-ci-robot commented Jan 30, 2025

googs1025 Feb 3, 2025

googs1025 Feb 3, 2025

googs1025 Feb 3, 2025

atantawi Feb 6, 2025

atantawi commented Feb 3, 2025

Huang-Wei Feb 6, 2025

atantawi Feb 6, 2025

Huang-Wei Feb 6, 2025

atantawi Feb 6, 2025

Huang-Wei Feb 6, 2025

atantawi Feb 6, 2025 •

edited

Loading

Huang-Wei Feb 6, 2025

atantawi Feb 6, 2025

alculquicondor commented Feb 6, 2025

dom4ha commented Feb 7, 2025


		### Cluster topology

		The cluster topology is specified as a tree where the leaves are the nodes in the cluster.


		![architecture](images/architecture.png)

		![plugin state transitions](images/state-transitions.png)

add kep 4986 sakkara cluster topology #837

Are you sure you want to change the base?

add kep 4986 sakkara cluster topology #837

Conversation

atantawi commented Dec 10, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Dec 10, 2024

netlify bot commented Dec 10, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

googs1025 commented Dec 13, 2024

atantawi commented Jan 7, 2025

atantawi commented Jan 7, 2025

k8s-ci-robot commented Jan 24, 2025

13567436138 commented Jan 27, 2025

atantawi commented Jan 27, 2025

13567436138 commented Jan 28, 2025

googs1025 commented Jan 28, 2025

13567436138 commented Jan 28, 2025

atantawi commented Jan 30, 2025

k8s-ci-robot commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atantawi commented Feb 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atantawi Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Feb 6, 2025

dom4ha commented Feb 7, 2025

atantawi commented Dec 10, 2024 •

edited

Loading

netlify bot commented Dec 10, 2024 •

edited

Loading

atantawi Feb 6, 2025 •

edited

Loading