-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][ADAG]Enable NPU (hccl) communication for aDAG #47658
base: master
Are you sure you want to change the base?
[WIP][ADAG]Enable NPU (hccl) communication for aDAG #47658
Conversation
Signed-off-by: zhilong <[email protected]>
cc @ruisearch42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a round of review since I was tagged.
Overall looks good. Do you plan to add a test?
Let me know when this is ready to review.
from ray.experimental.channel.nccl_group import _NcclGroup | ||
|
||
else: | ||
from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this looks like a hack. Do you plan to change to a cleaner approach?
@@ -16,7 +16,7 @@ | |||
@DeveloperAPI | |||
class GPUCommunicator(ABC): | |||
""" | |||
Communicator for a group of aDAG actors on Nvidia GPU. | |||
Communicator for a group of aDAG actors on Nvidia GPU or other XPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.
self._device_id = device_id | ||
|
||
if rank is not None: | ||
assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.get_gpu_ids()
seems to only get GPU IDs?
os.environ['MASTER_ADDR'] = '127.0.0.1' | ||
os.environ['MASTER_PORT'] = '29500' | ||
os.environ['HCCL_WHITELIST_DISABLE'] = '1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make these configurable?
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.