Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][REP] GPU Memory awareness scheduling #47

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jonathan-anyscale
Copy link

@jonathan-anyscale jonathan-anyscale commented Nov 19, 2023

The GPU memory scheduling prototype:
ray-project/ray#41147

Signed-off-by: Jonathan Nitisastro <[email protected]>
@jonathan-anyscale jonathan-anyscale marked this pull request as draft November 19, 2023 03:34
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
.ipynb_checkpoints/Untitled-checkpoint.ipynb Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <[email protected]>
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will continue

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <[email protected]>
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved
jonathan-anyscale and others added 2 commits December 2, 2023 20:20
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jonathan-anyscale jonathan-anyscale marked this pull request as ready for review December 6, 2023 20:00
```python
# Request a fractional GPU with specified gpu_memory in bytes.
# Mutually exclusive with num_gpus.
@ray.remote(gpu_memory=1024 * 1024 * 1024) # 1 mb request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support string-based syntactic sugar? Feels more pythonic that way (i.e., gpu_memory="3gb")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we just follow how memory is defined. I think the pythonic support can be done separately which covers both gpu_memory and memory changes

```python
pg = placement_group([{"gpu_memory": 1024 * 1024, "CPU": 1}, {"GPU": 1}])
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the observability section here as this complicates the observability semantics.

  • how is it displayed in ray status?
    • for ray status, it should potentially display sth like gpu_memory: 4 gpus (A10) * 3gb?
    • In ray status, if a task is scheduled with gpu_memory, both gpu & gpu memory values are subtracted?
  • How is it displayed in resource_requirement in ray list tasks? Is it translated into num_gpus? Or it only includes gpu_memory? Or both?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ray list nodes, it will be GPU (resources left) * gpu_memory_per_gpu which is the constant stored in node label. ray status, ray list task and ray.available_resources currently didn't show GPU memory but if we added one, it will be the same as ray list nodes.

and yes, basically both gpu and gpu_memory values are subtracted to show the remaining


# Requesting 30GB of GPU memory from a A10 GPU with 24GB of memory.
# Task won't be able to be scheduled.
@ray.remote(gpu_memory=30 * 1024 * 1024 * 1024 * 1024, accelerator_type="NVIDIA_TESLA_A10G")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have 40GB gpu

and schedule 1 task with 20GB
and schedule another with with num_gpus=1, would it fail to schedule?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the second one will fail since the GPU remaining after scheduler 20GB task will be 0.5


```python
# Request a fractional GPU both num_gpus and gpu_memory is not allowed
@ray.remote(gpu_memory=1024 * 1024 * 1024, num_gpus=0.5) # raise ValueError exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to express 2 GPUs using gpu_memory? Or is it not allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you specify this in REP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not allowed, since only either one of num_gpus or gpu_memory (1 gpu per request) can be specified in request

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could they both be allowed? If both num_gpus and gpu_memory are specified, then it would require that much memory on that many GPUs. num_gpus would default to 1, so not specifying it would get the behavior described above. It could be an error condition to specify a fractional value for num_gpus if also specifying gpu_memory. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants