[core][aDAG] Clean up shutdown path #47702

rkooo567 · 2024-09-17T06:42:28Z

Why are these changes needed?

Vairous fixes including #47685 to clean up shutdown path.

Make teardown idemoptent
Remove unblocking teardown(wait=False). It is prone to all weird errors because we don't do synchronization properly before shutdown.
Previously, we teardown on __del__. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown inside shutdown API which is called when the interpreter shutdown
Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler).
Change teardown logs to debug so that it won't be printed unless necessary.

Related issue number

Closes #47685 (comment)

Closes #47628.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang

LGTM. Can you add a description of the changes to the PR text?

python/ray/dag/compiled_dag_node.py

rkooo567 · 2024-09-17T17:53:27Z

python/ray/experimental/channel/common.py

-        return [c.read() for c in self._input_channels]
+        results = []
+        for c in self._input_channels:
+            try:


add while loop

python/ray/dag/compiled_dag_node.py

ruisearch42

Otherwise LGTM

python/ray/dag/compiled_dag_node.py

python/ray/experimental/channel/common.py

python/ray/dag/compiled_dag_node.py

python/ray/experimental/channel/common.py

stephanie-wang · 2024-09-22T22:30:50Z

This might also close #47413?

rkooo567 · 2024-09-22T23:17:59Z

Yeah I actually asked him to try... let's see how it goes

anyscalesam · 2024-10-08T17:24:06Z

Note - in standup today this is blocking other PRs from merging in #compiled-graph cc @stephanie-wang

rkooo567 · 2024-10-08T17:29:46Z

ah yeah sorry about that... Feel free to push code directly and merge it. I think it should be almost ready

Signed-off-by: Stephanie Wang <[email protected]>

rkooo567 · 2024-10-21T03:28:44Z

fixed the issue. should be ready to be merged tmrw

rkooo567 · 2024-10-22T04:03:12Z

@stephanie-wang @ruisearch42 can you give an approval for merge?

ruisearch42

Otherwise LGTM

ruisearch42 · 2024-10-22T04:49:08Z

python/ray/_private/worker.py

@@ -1874,6 +1874,11 @@ def shutdown(_exiting_interpreter: bool = False):
            and false otherwise. If we are exiting the interpreter, we will
            wait a little while to print any extra error messages.
    """
+    # Make sure to clean up compiled dag node if exists.


Update docstring as well?

feel like it is kind of implementation details?

python/ray/dag/compiled_dag_node.py

python/ray/experimental/channel/common.py

ruisearch42 · 2024-10-22T05:24:50Z

python/ray/experimental/channel/common.py

+        for c in self._input_channels:
+            exiting = retry_and_check_interpreter_exit(
+                lambda: results.append(c.read(timeout=1))
+            )
+            if exiting:
+                break


Just to understand, the original code hangs because it waits inside C++ and therefore does not respect interpreter exiting?

no. it is due to

Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler).

rkooo567 · 2024-10-22T15:12:04Z

I will address the. comments in a follow up

Vairous fixes including ray-project#47685 to clean up shutdown path. Make teardown idemoptent Remove unblocking teardown(wait=False). It is prone to all weird errors because we don't do synchronization properly before shutdown. Previously, we teardown on __del__. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown inside shutdown API which is called when the interpreter shutdown Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler). Change teardown logs to debug so that it won't be printed unless necessary.

This reverts commit bc99a3b. Signed-off-by: Rui Qiao <[email protected]>

) ## Why are these changes needed? This PR fixes test_torch_tensor_dag_gpu with the following quick patches: 1. Revert #47702 , otherwise there is segfault 2. Move TestNcclGroup as an inner class for the tests, otherwise there are the following error: ``` (TorchTensorWorker pid=2261373) No module named 'test_torch_tensor_dag' (TorchTensorWorker pid=2261373) Traceback (most recent call last): (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 460, in deserialize_objects (TorchTensorWorker pid=2261373) obj = self._deserialize_object(data, metadata, object_ref) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 317, in _deserialize_object (TorchTensorWorker pid=2261373) return self._deserialize_msgpack_data(data, metadata_fields) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data (TorchTensorWorker pid=2261373) python_objects = self._deserialize_pickle5_data(pickle5_data) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data (TorchTensorWorker pid=2261373) obj = pickle.loads(in_band) (TorchTensorWorker pid=2261373) ModuleNotFoundError: No module named 'test_torch_tensor_dag' ```

#48250) …ure (#48204)" This reverts commit 23bb654. Revert revert of #47702.

Vairous fixes including ray-project#47685 to clean up shutdown path. Make teardown idemoptent Remove unblocking teardown(wait=False). It is prone to all weird errors because we don't do synchronization properly before shutdown. Previously, we teardown on __del__. It works well at runtime but not at shutdown time because desetruction order is not guaranteed at shutdown. We should use atexit handler instead. To fix this issue, I keep tracking of all compiled dags created (via weakref) and do teardown inside shutdown API which is called when the interpreter shutdown Fix asyncio read/write being blocked and joined forever issue. We check read/write every 1 second and check sys.is_finalizing() which sets to True upon interpreting exiting time. We can't rely on atexit handler.teardown because asyncio read/write runs in thread pool, and thread pool is joined "before at exit handler is executed". See https://github.com/python/cpython/blob/8f82d9aa2191db7826bb7a453fe06ce65f966cf8/Lib/concurrent/futures/thread.py#L37 (this atexit handler always is called before python's regular atexit handler). Change teardown logs to debug so that it won't be printed unless necessary.

…-project#48204) ## Why are these changes needed? This PR fixes test_torch_tensor_dag_gpu with the following quick patches: 1. Revert ray-project#47702 , otherwise there is segfault 2. Move TestNcclGroup as an inner class for the tests, otherwise there are the following error: ``` (TorchTensorWorker pid=2261373) No module named 'test_torch_tensor_dag' (TorchTensorWorker pid=2261373) Traceback (most recent call last): (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 460, in deserialize_objects (TorchTensorWorker pid=2261373) obj = self._deserialize_object(data, metadata, object_ref) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 317, in _deserialize_object (TorchTensorWorker pid=2261373) return self._deserialize_msgpack_data(data, metadata_fields) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data (TorchTensorWorker pid=2261373) python_objects = self._deserialize_pickle5_data(pickle5_data) (TorchTensorWorker pid=2261373) File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data (TorchTensorWorker pid=2261373) obj = pickle.loads(in_band) (TorchTensorWorker pid=2261373) ModuleNotFoundError: No module named 'test_torch_tensor_dag' ```

ray-project#48250) …ure (ray-project#48204)" This reverts commit 23bb654. Revert revert of ray-project#47702.

shutdown cleanup.

0f236ba

rkooo567 added the go add ONLY when ready to merge, run all tests label Sep 17, 2024

rkooo567 added 2 commits September 17, 2024 00:43

fixed

106580f

fixed

5685920

rkooo567 mentioned this pull request Sep 17, 2024

[ADAG] Add checking for worker #47227

Closed

3 tasks

rkooo567 assigned stephanie-wang Sep 17, 2024

stephanie-wang reviewed Sep 17, 2024

View reviewed changes

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

addressed code review.

58feae7

rkooo567 commented Sep 19, 2024

View reviewed changes

rkooo567 changed the title ~~[WIP][core][aDAG] Clean up shutdown path~~ [core][aDAG] Clean up shutdown path Sep 19, 2024

ruisearch42 reviewed Sep 19, 2024

View reviewed changes

kevin85421 reviewed Sep 19, 2024

View reviewed changes

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/common.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/common.py Outdated Show resolved Hide resolved

stephanie wang and others added 9 commits October 8, 2024 15:28

Merge remote-tracking branch 'upstream/master' into HEAD

32f9483

lint

0201917

Signed-off-by: Stephanie Wang <[email protected]>

lint

95fe371

Signed-off-by: Stephanie Wang <[email protected]>

lint

7275fec

Signed-off-by: Stephanie Wang <[email protected]>

update

9ba5439

Signed-off-by: Stephanie Wang <[email protected]>

addressed code review

8c53680

Merge branch 'master' into fix-asyncio-hang

9ead2c3

.

02531cd

fixed

d16287a

.

2e4537a

ruisearch42 approved these changes Oct 22, 2024

View reviewed changes

rkooo567 merged commit bc99a3b into ray-project:master Oct 22, 2024
5 checks passed

ruisearch42 mentioned this pull request Oct 22, 2024

[core][compiled graphs] Fix test_torch_tensor_dag_gpu CI failure #48204

Merged

8 tasks

ruisearch42 pushed a commit to ruisearch42/ray that referenced this pull request Oct 22, 2024

Revert "[core][aDAG] Clean up shutdown path (ray-project#47702)"

409df5e

This reverts commit bc99a3b. Signed-off-by: Rui Qiao <[email protected]>

stephanie-wang pushed a commit that referenced this pull request Oct 25, 2024

Revert "[core][compiled graphs] Fix test_torch_tensor_dag_gpu CI fail… (

f44890f

#48250) …ure (#48204)" This reverts commit 23bb654. Revert revert of #47702.

Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024

Revert "[core][compiled graphs] Fix test_torch_tensor_dag_gpu CI fail… (

d6adc25

ray-project#48250) …ure (ray-project#48204)" This reverts commit 23bb654. Revert revert of ray-project#47702.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][aDAG] Clean up shutdown path #47702

[core][aDAG] Clean up shutdown path #47702

rkooo567 commented Sep 17, 2024 •

edited by stephanie-wang

Loading

stephanie-wang left a comment

rkooo567 Sep 17, 2024

ruisearch42 left a comment

stephanie-wang commented Sep 22, 2024

rkooo567 commented Sep 22, 2024

anyscalesam commented Oct 8, 2024

rkooo567 commented Oct 8, 2024

rkooo567 commented Oct 21, 2024

rkooo567 commented Oct 22, 2024

ruisearch42 left a comment

ruisearch42 Oct 22, 2024

rkooo567 Oct 26, 2024

ruisearch42 Oct 22, 2024

rkooo567 Oct 26, 2024

rkooo567 commented Oct 22, 2024

[core][aDAG] Clean up shutdown path #47702

[core][aDAG] Clean up shutdown path #47702

Conversation

rkooo567 commented Sep 17, 2024 • edited by stephanie-wang Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

rkooo567 Sep 17, 2024

Choose a reason for hiding this comment

ruisearch42 left a comment

Choose a reason for hiding this comment

stephanie-wang commented Sep 22, 2024

rkooo567 commented Sep 22, 2024

anyscalesam commented Oct 8, 2024

rkooo567 commented Oct 8, 2024

rkooo567 commented Oct 21, 2024

rkooo567 commented Oct 22, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

ruisearch42 Oct 22, 2024

Choose a reason for hiding this comment

rkooo567 Oct 26, 2024

Choose a reason for hiding this comment

ruisearch42 Oct 22, 2024

Choose a reason for hiding this comment

rkooo567 Oct 26, 2024

Choose a reason for hiding this comment

rkooo567 commented Oct 22, 2024

rkooo567 commented Sep 17, 2024 •

edited by stephanie-wang

Loading