-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node affinity feature enhancements REP #22
Conversation
This REP seems like it would be blocked on the autoscaler / GCS refactoring proposal, right @scv119 ? Do we need to have that out first before reviewing this one? |
3. Scheduling optimization through Labels | ||
Now any node raylet has node static labels information for all nodes. | ||
when NodeAffinity schedules, if it traverses the Labels of each node, the algorithm complexity is very large, and the performance will be poor. | ||
** Therefore, it is necessary to generate a full-cluster node labels index table to improve scheduling performance. </b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the index strategy for the more complex expressions such as negation and containment? I wonder if we need to cache the list of valid nodes per task/expression to avoid O(n^2) scheduling slowdowns with many nodes and tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
26e4668
to
5dc2667
Compare
5dc2667
to
98bc2e9
Compare
98bc2e9
to
6144b94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for putting it together!
The main comment we should resolve before approval is about the autoscaling implementation. Ideally we would not block this REP on the autoscaler refactor, but it depends on the complexity of this design. Could you address the comment I left about this, and then we can better determine order of operations?
The other question I had is about semantics for "soft". Can you confirm that soft only considers cluster feasibility of a request's label constraints? I.e., it would still schedule a task to a feasible node even if the node has high load?
3. If custom_resource happens to be the same as the spliced string of labels. Then it will affect the correctness of scheduling. | ||
|
||
### AutoScaler adaptation | ||
I understand that the adaptation of this part of AutoScaler should not be difficult. Now NodeAffinity is just one more scheduling strategy. Just follow the existing implementation of AutoScaler and adapt to the current strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is more advanced than the autoscaling for NodeAffinity.
NodeAffinity is only used for nodes that are already in the cluster, so I believe the only change there was to make sure that the autoscaler didn't mistakenly start additional nodes for a queued task that wouldn't actually be able to be scheduled there.
Since this proposal is for more flexible labels, we will need the autoscaler to be aware of which nodes it should add or remove from the cluster based on the label constraints of queued tasks. So this actually is something the proposal should cover. Specifically:
- what API changes do we need at the cluster config level and for the interface between the autoscaler and the GCS?
- what will the autoscaler policy look like?
- Handling scalability issues? (If an application uses too many different label policies, we'll have to report these in the task load)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think this is quite so simple. The autoscaler runs emulated bin-packing to determine what types of nodes it needs to add to the cluster. For example, suppose a user requests {"dc": "zone-1"}. Then, the autoscaler will need to run its bin packing routine with this request and node labels in mind, in order to determine the nodes to launch: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py
As you can see, the logic is algorithmically complex.
My assessment is that we probably need to refactor the autoscaler to unify this bin packing routine with the C++ implementation, to avoid having to write label affinity evaluation in both Python and C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl @stephanie-wang
After thinking about it carefully, this piece is indeed very complicated. I learned that scv119 (shen chen) is just doing AutoScaler's reconstruction REP. Because NodeAffinity/ActorAffinity will take a while to develop, I think we can focus on the adaptation of the new version of AutoScaler. I will also actively discuss the restart design of AutoScaler with scv119 (shen chen).
I think the adaptation of AutoScaler is divided into two parts.
-
Interaction API between AutoScaler and gcs.
I think I will adapt it according to the original method at that time, Mainly adding new fields. -
AutoScaler policy with node affinity (which decide what node needs to add to the cluster)
This piece is indeed more complicated, and can be realized as follows:
- The types of nodes that AutoScaler can add to the cluster are generally predicted or configured. For example:
Node Type 1: 4C8G labels={"instance":"4C8G"} ,
Node Type 2: 8C16G labels={"instance":"8C16G"},
Node Type 3: 16C32G labels={"instance":"16C32G "}
- The label of NodeAffinity is the case in the standby node.
For example, the following scenario:
Actor.options(num_cpus=1, scheuling_strategy=NodeAffinity(label_in("instance": "4C8G")).remote()
If the Actor is pending, the autoscaler traverses the prepared nodes to see which node meets the requirements of [resource: 1C, node label: {"instance":"4C8G"}]. If so, add it to the cluster.
- The label of NodeAffinity is a unique special label
In this scenario, it is considered that the Actor/Task wants to be compatible with a special node, and there is no need to expand the node for it.
eg:
Actor.options(num_cpus=1, scheuling_strategy=NodeAffinity(label_in("node_ip": "xxx.xx.xx.xx")).remote()
- anti-affinity to a node with special label.
It can be pre-judged whether the prefabricated nodes can meet this anti-affinity requirement, and if so, they can be added to the cluster.
- Soft strategy.
In this scenario, it can be pre-judged whether the labels and resources of the prefabricated nodes can be satisfied, and if so, use such nodes. If none of the labels can satisfy, just add a node with enough resources.
e1cf299
to
47ddec6
Compare
@scv119 @ericl @stephanie-wang
|
|
||
@ericl @stephanie-wang @wumuzi520 SenlinZhu @Chong Li @scv119 (Chen Shen) @Sang Cho @jjyao (Jiajun Yao) @Yi Cheng | ||
### Shepherd of the Proposal (should be a senior committer) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a required piece of information. I suggest @scv119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree . I will add it.
Signed-off-by: 稚鱼 <[email protected]>
@larrylian let me know once you are done with autoscaler part. |
The implementation & API makes sense to me. I have a couple questions regarding usability for common cases.
|
@rkooo567 Thanks for the very helpful advice.
I don't really understand what you mean, can you explain? |
oh I think it is the same comment as #22 (comment) actually. So I believe it is addressed |
The value of introducing the Labels mechanism for scheduling was discussed in this REP([Add labels mechanism] #13 ) before. NodeAffinity is now discussed separately from this REP.
In fact, everyone should know about NodeAffinity before. I think we can reach an agreement on the following points in this REP.