Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load data before comms and nccl abort to solve issues with empty partitions #464

Merged

Conversation

eordentlich
Copy link
Collaborator

@eordentlich eordentlich commented Oct 7, 2023

In further testing of handling of empty partitions added previously on workstation and dgx (local mode) found that even though we now raise an exception in this case, this can still result in hangs in python workers. Sometimes the job hangs (dgx) and other times exits but the python worker isn't killed (docker on local workstation).

Loading data before setting up comms (to prevent a worker with data from racing ahead and starting e.g an allreduce) and the nccl abort seems to clean things up in these cases.

I don't think this is the root cause of #453 but should be addressed.

@eordentlich
Copy link
Collaborator Author

build

@@ -326,7 +326,7 @@ def test_nearest_neighbors(
random_state=0,
) # make_blobs creates a random dataset of isotropic gaussian blobs.

# set average norm to be 1 to allow comparisons with default error thresholds
# set average norm sq to be 1 to allow comparisons with default error thresholds
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was still wrong so correcting in this PR.

leewyang
leewyang previously approved these changes Oct 9, 2023
@eordentlich
Copy link
Collaborator Author

Did some benchmarking with 2 gpus and the change (additional barrier) seems to add 1sec of latency to fit.

@eordentlich eordentlich changed the title add a sync and nccl abort to solve issues with empty partitions load data before comms and nccl abort to solve issues with empty partitions Oct 12, 2023
@eordentlich
Copy link
Collaborator Author

build

@eordentlich
Copy link
Collaborator Author

Moving data loading before nccl comms setup seems to also solve the issues without the extra barrier in previous version. Tested with uvm on databricks and seems to be ok.
While testing had to upgrade spark-rapids jar to 23.08.2 to avoid databricks crashes.

@eordentlich eordentlich merged commit da9661c into NVIDIA:branch-23.10 Oct 12, 2023
2 checks passed
@eordentlich eordentlich deleted the eo_fix_empty_partition_hangs branch October 12, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants