Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node affinity feature enhancements REP #22

Merged
merged 3 commits into from
Apr 17, 2023
Merged

Conversation

larrylian
Copy link
Contributor

@larrylian larrylian commented Feb 6, 2023

The value of introducing the Labels mechanism for scheduling was discussed in this REP([Add labels mechanism] #13 ) before. NodeAffinity is now discussed separately from this REP.

In fact, everyone should know about NodeAffinity before. I think we can reach an agreement on the following points in this REP.

  • Agreed on API format
  • The main implementation plan is agreed
  • Ask specific single-point questions, and then discuss and reach a consensus
  • Some specific details can be discussed during the coding process.

@ericl
Copy link
Contributor

ericl commented Feb 6, 2023

This REP seems like it would be blocked on the autoscaler / GCS refactoring proposal, right @scv119 ? Do we need to have that out first before reviewing this one?

3. Scheduling optimization through Labels
Now any node raylet has node static labels information for all nodes.
when NodeAffinity schedules, if it traverses the Labels of each node, the algorithm complexity is very large, and the performance will be poor.
** Therefore, it is necessary to generate a full-cluster node labels index table to improve scheduling performance. </b>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the index strategy for the more complex expressions such as negation and containment? I wonder if we need to cache the list of valid nodes per task/expression to avoid O(n^2) scheduling slowdowns with many nodes and tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag indexing strategy is actually very simple, I added it in the REP. The tag index only needs to get the nodes with these key&values.
image

@larrylian
Copy link
Contributor Author

larrylian commented Feb 7, 2023

This REP seems like it would be blocked on the autoscaler / GCS refactoring proposal, right @scv119 ? Do we need to have that out first before reviewing this one?
@ericl
Thanks for checking it so quickly, I just finished it now.
This REP GCS has not changed much, so I didn't describe it in a separate chapter. Then I briefly mentioned AutoScaler, and if you think you have other questions, I can add it.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for putting it together!

The main comment we should resolve before approval is about the autoscaling implementation. Ideally we would not block this REP on the autoscaler refactor, but it depends on the complexity of this design. Could you address the comment I left about this, and then we can better determine order of operations?

The other question I had is about semantics for "soft". Can you confirm that soft only considers cluster feasibility of a request's label constraints? I.e., it would still schedule a task to a feasible node even if the node has high load?

3. If custom_resource happens to be the same as the spliced string of labels. Then it will affect the correctness of scheduling.

### AutoScaler adaptation
I understand that the adaptation of this part of AutoScaler should not be difficult. Now NodeAffinity is just one more scheduling strategy. Just follow the existing implementation of AutoScaler and adapt to the current strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is more advanced than the autoscaling for NodeAffinity.

NodeAffinity is only used for nodes that are already in the cluster, so I believe the only change there was to make sure that the autoscaler didn't mistakenly start additional nodes for a queued task that wouldn't actually be able to be scheduled there.

Since this proposal is for more flexible labels, we will need the autoscaler to be aware of which nodes it should add or remove from the cluster based on the label constraints of queued tasks. So this actually is something the proposal should cover. Specifically:

  • what API changes do we need at the cluster config level and for the interface between the autoscaler and the GCS?
  • what will the autoscaler policy look like?
  • Handling scalability issues? (If an application uses too many different label policies, we'll have to report these in the task load)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think this is quite so simple. The autoscaler runs emulated bin-packing to determine what types of nodes it needs to add to the cluster. For example, suppose a user requests {"dc": "zone-1"}. Then, the autoscaler will need to run its bin packing routine with this request and node labels in mind, in order to determine the nodes to launch: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py

As you can see, the logic is algorithmically complex.

My assessment is that we probably need to refactor the autoscaler to unify this bin packing routine with the C++ implementation, to avoid having to write label affinity evaluation in both Python and C++.

Copy link
Contributor Author

@larrylian larrylian Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl @stephanie-wang
After thinking about it carefully, this piece is indeed very complicated. I learned that scv119 (shen chen) is just doing AutoScaler's reconstruction REP. Because NodeAffinity/ActorAffinity will take a while to develop, I think we can focus on the adaptation of the new version of AutoScaler. I will also actively discuss the restart design of AutoScaler with scv119 (shen chen).

I think the adaptation of AutoScaler is divided into two parts.

  1. Interaction API between AutoScaler and gcs.
    I think I will adapt it according to the original method at that time, Mainly adding new fields.

  2. AutoScaler policy with node affinity (which decide what node needs to add to the cluster)
    This piece is indeed more complicated, and can be realized as follows:

  1. The types of nodes that AutoScaler can add to the cluster are generally predicted or configured. For example:
    Node Type 1: 4C8G labels={"instance":"4C8G"} ,
    Node Type 2: 8C16G labels={"instance":"8C16G"},
    Node Type 3: 16C32G labels={"instance":"16C32G "}
  1. The label of NodeAffinity is the case in the standby node.
    For example, the following scenario:
    Actor.options(num_cpus=1, scheuling_strategy=NodeAffinity(label_in("instance": "4C8G")).remote()
    If the Actor is pending, the autoscaler traverses the prepared nodes to see which node meets the requirements of [resource: 1C, node label: {"instance":"4C8G"}]. If so, add it to the cluster.
  1. The label of NodeAffinity is a unique special label
    In this scenario, it is considered that the Actor/Task wants to be compatible with a special node, and there is no need to expand the node for it.
    eg:
    Actor.options(num_cpus=1, scheuling_strategy=NodeAffinity(label_in("node_ip": "xxx.xx.xx.xx")).remote()
  1. anti-affinity to a node with special label.
    It can be pre-judged whether the prefabricated nodes can meet this anti-affinity requirement, and if so, they can be added to the cluster.
  1. Soft strategy.
    In this scenario, it can be pre-judged whether the labels and resources of the prefabricated nodes can be satisfied, and if so, use such nodes. If none of the labels can satisfy, just add a node with enough resources.

@larrylian
Copy link
Contributor Author

larrylian commented Feb 15, 2023

@scv119 @ericl @stephanie-wang

  1. We understand that you also have demands for Runtime Env acceleration, and I have also added a proposal to use Node Labels to achieve Runtime Env acceleration, you can take a look.

  2. Do you have any comments on the two REPs Node affinity and Actor Affinity? Can you help advance the progress of these two REPs? This way I can start developing as soon as possible.

cc @wumuzi520 @SongGuyang

Use Node Labels to achieve Runtime Env acceleration:
image


@ericl @stephanie-wang @wumuzi520 SenlinZhu @Chong Li @scv119 (Chen Shen) @Sang Cho @jjyao (Jiajun Yao) @Yi Cheng
### Shepherd of the Proposal (should be a senior committer)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a required piece of information. I suggest @scv119

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree . I will add it.

@scv119
Copy link
Contributor

scv119 commented Mar 9, 2023

@larrylian let me know once you are done with autoscaler part.

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 24, 2023

The implementation & API makes sense to me. I have a couple questions regarding usability for common cases.

  1. When deploying Ray, many ppl use the autoscaler/kuberay, which only allows users to have homogeneous commands for all worker nodes (worker_startup_commands). I.e., for normal users, it will be impossible/difficult to set different node labels depending for different worker nodes. We should allow to pass the label via env var + allow users to specify labels like we specify resources (from the autoscaler & kuberay config). For example, in this yaml file, we should have a label field next to resource fields https://docs.ray.io/en/master/cluster/vms/user-guides/launching-clusters/aws.html#create-a-minimal-cluster-config-yaml-named-cloudwatch-basic-yaml-with-the-following-contents.
  2. NodeAffinitySchedulingStrategy vs node_affinity seems confusing, and I think we should combine these two.
  3. We may need to introduce the default labels. E.g., instance type, zone, whether or not it has GPU sort of information can be added by default. It probably doesn't need to be addressed within this REP

@larrylian
Copy link
Contributor Author

@rkooo567 Thanks for the very helpful advice.

  1. It is no problem to set the static label of the node through the environment variable. I'll add it to the docs later.
  2. Preset the default label > I also have this plan. At that time, I will preset some common default tags such as "IP", "host name", "whether there is a GPU"

NodeAffinitySchedulingStrategy vs node_affinity seems confusing, and I think we should combine these two.

I don't really understand what you mean, can you explain?

@rkooo567
Copy link
Contributor

oh I think it is the same comment as #22 (comment) actually. So I believe it is addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants