-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][aDAG] Clean up shutdown path #47702
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you add a description of the changes to the PR text?
return [c.read() for c in self._input_channels] | ||
results = [] | ||
for c in self._input_channels: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add while loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
This might also close #47413? |
Yeah I actually asked him to try... let's see how it goes |
Note - in standup today this is blocking other PRs from merging in #compiled-graph cc @stephanie-wang |
ah yeah sorry about that... Feel free to push code directly and merge it. I think it should be almost ready |
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
fixed the issue. should be ready to be merged tmrw |
@stephanie-wang @ruisearch42 can you give an approval for merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
@@ -1874,6 +1874,11 @@ def shutdown(_exiting_interpreter: bool = False): | |||
and false otherwise. If we are exiting the interpreter, we will | |||
wait a little while to print any extra error messages. | |||
""" | |||
# Make sure to clean up compiled dag node if exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel like it is kind of implementation details?
for c in self._input_channels: | ||
exiting = retry_and_check_interpreter_exit( | ||
lambda: results.append(c.read(timeout=1)) | ||
) | ||
if exiting: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, the original code hangs because it waits inside C++ and therefore does not respect interpreter exiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. it is due to
Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler).
I will address the. comments in a follow up |
Vairous fixes including ray-project#47685 to clean up shutdown path. Make teardown idemoptent Remove unblocking teardown(wait=False). It is prone to all weird errors because we don't do synchronization properly before shutdown. Previously, we teardown on __del__. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown inside shutdown API which is called when the interpreter shutdown Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler). Change teardown logs to debug so that it won't be printed unless necessary.
This reverts commit bc99a3b. Signed-off-by: Rui Qiao <[email protected]>
) ## Why are these changes needed? This PR fixes test_torch_tensor_dag_gpu with the following quick patches: 1. Revert #47702 , otherwise there is segfault 2. Move TestNcclGroup as an inner class for the tests, otherwise there are the following error: ``` (TorchTensorWorker pid=2261373) No module named 'test_torch_tensor_dag' (TorchTensorWorker pid=2261373) Traceback (most recent call last): (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 460, in deserialize_objects (TorchTensorWorker pid=2261373) obj = self._deserialize_object(data, metadata, object_ref) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 317, in _deserialize_object (TorchTensorWorker pid=2261373) return self._deserialize_msgpack_data(data, metadata_fields) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data (TorchTensorWorker pid=2261373) python_objects = self._deserialize_pickle5_data(pickle5_data) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data (TorchTensorWorker pid=2261373) obj = pickle.loads(in_band) (TorchTensorWorker pid=2261373) ModuleNotFoundError: No module named 'test_torch_tensor_dag' ```
Vairous fixes including ray-project#47685 to clean up shutdown path. Make teardown idemoptent Remove unblocking teardown(wait=False). It is prone to all weird errors because we don't do synchronization properly before shutdown. Previously, we teardown on __del__. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown inside shutdown API which is called when the interpreter shutdown Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler). Change teardown logs to debug so that it won't be printed unless necessary.
…-project#48204) ## Why are these changes needed? This PR fixes test_torch_tensor_dag_gpu with the following quick patches: 1. Revert ray-project#47702 , otherwise there is segfault 2. Move TestNcclGroup as an inner class for the tests, otherwise there are the following error: ``` (TorchTensorWorker pid=2261373) No module named 'test_torch_tensor_dag' (TorchTensorWorker pid=2261373) Traceback (most recent call last): (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 460, in deserialize_objects (TorchTensorWorker pid=2261373) obj = self._deserialize_object(data, metadata, object_ref) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 317, in _deserialize_object (TorchTensorWorker pid=2261373) return self._deserialize_msgpack_data(data, metadata_fields) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data (TorchTensorWorker pid=2261373) python_objects = self._deserialize_pickle5_data(pickle5_data) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data (TorchTensorWorker pid=2261373) obj = pickle.loads(in_band) (TorchTensorWorker pid=2261373) ModuleNotFoundError: No module named 'test_torch_tensor_dag' ```
ray-project#48250) …ure (ray-project#48204)" This reverts commit 23bb654. Revert revert of ray-project#47702.
Why are these changes needed?
Vairous fixes including #47685 to clean up shutdown path.
__del__
. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown insideshutdown
API which is called when the interpreter shutdownsys.is_finalizing()
which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler).Related issue number
Closes #47685 (comment)
Closes #47628.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.