-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kep 4986 sakkara cluster topology #837
base: master
Are you sure you want to change the base?
Conversation
Hi @atantawi. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.
|
/ok-to-test |
/retest-required |
/retest |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atantawi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
when to implement this plugin? |
I am planning to raise a PR and put the code in there shortly. |
@atantawi awesome |
/cc I will try to understand this plugin. |
How short |
/retest |
@atantawi: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
1- **User A:** | ||
|
||
Want to deploy a job with 8 pods, equally partitioned into two partitions, where each partition is placed in separate rack, and the pods belonging to a partition are packed on nodes within the rack. Partitioning improves availability, and packing improves performance. | ||
|
||
 | ||
|
||
 | ||
|
||
2- **User B:** | ||
|
||
Want to place a job with 6 pods packed on same physical server, as much as possible, where pods on the same server are spread across nodes in that server, but in the range [2,4]. Having all pods on the same physical server improves communication among the pods. And, having less than 2 pods in a node spreads the pods too thinly, and having more than 4 may leads to congestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial thought for me is: can similar problems be solved by using existing affinity
or anti-affinity
or Topology Spread Constraints
? But this seems to be "multi-level", and there are many things to consider. 🤔
|
||
### Cluster topology | ||
|
||
The cluster topology is specified as a tree where the leaves are the nodes in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether clusters need such a complex definition (at least I do not have such a complex definition internally... this might be my problem). In my work, there is only a one-dimensional level, which is described as "node pool". And I optimize scheduling based on this dimension. 🤔
- *level-names*: Names of the levels as an array of strings ordered top to bottom (excluding the root). | ||
- *tree*: String specification of the tree. | ||
|
||
### Group constraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question: will the number of configmaps for group constraints
increase as more group settings are added? Can we merge configs into the same one? More configs sometime mean more difficult maintenance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one configMap (alternatively, a CR or an extension of the PodGroup) per group. Group constraints is a key in the map (or a spec).
|
|
||
```yaml | ||
apiVersion: v1 | ||
kind: ConfigMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker for the KEP, but I highly recommend to start with CRD since day 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Should we introduce a new CR or extend the PodGroup?
"node-0": {}, | ||
"node-1": {}, | ||
"node-2": {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we present the topological info if the cluster has a couple of thousands nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! We need a tool to generate/update the tree definition. Or, we should consider an alternative way to code the tree. One possibility is to have a node (a leaf in the cluster topology tree) code the path to the root as labels. Problem is the plugin will need to create the tree every time a group needs to be scheduled. (Unless it's cached and updated whenever node labels change.)
sakkara.member.status: Bound | ||
``` | ||
|
||
The rank, status, and number of placement trials for the pod are provided as labels. This way, an (init) container in the pod may read the rank label and configure itself accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an anti-pattern to update a pod's labels to store extra info. Update a pod is expensive, no need to say updating a gang of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important data is the rank. Having it as a pod label simplifies the logic of the container getting it initially as an environment variable. Alternatively, rank values for all pods in the group may be written in the CR for the group. The challenge is for the containers to get them? I guess if we're using a configmap for the group, the container could mount it.
|
||
 | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the Design section, I still cannot figure out the exact responsibility of each phase. Like, which phase does the "pre-scheduling" to update the pod's label, what if the "pre-scheduling" result is in conflict with the real-time scheduling decision, how preemption works, etc.
It's quite essential to illustrate: by given a group of pods carrying the group constraints, in an e2e manner how they're processed in each phase of a scheduling cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll work on providing a clarifying diagram and maybe a table with description.
/cc @johnbelamaric
I believe that TAS scheduling is important and it's great that you are sharing your approach and offering contribution. I believe that we should think how the proposal interacts with the aspects that I mentioned (frankly, I know questions but not answers yet). I'm especially curious whether the reservation ideas is something that we want to pursue (as community). To scope the discussion for now, my major concern is whether the proposal that is based on topology scheduling enclosed into a single plugin is capable to take into consideration external constraints (including implicit topology constraints). IIUC the solver knows the topology, knows the group of pods, but does it know about external constraints put by other plugins? |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add kep for Sakkara plugin, a hierarchical cluster topology group scheduler
Which issue(s) this PR fixes:
Fixes #4986
kubernetes/enhancements#4986
Special notes for your reviewer:
NONE
Does this PR introduce a user-facing change?