Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Deflake test_metrics #47750

Closed
wants to merge 36 commits into from

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Sep 19, 2024

Why are these changes needed?

Split out test_metrics to run on it's own so the metrics will not be polluted by other tests.

Related issue number

Closes #45843

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Gene Su <[email protected]>
@GeneDer GeneDer added the go add ONLY when ready to merge, run all tests label Sep 19, 2024
@GeneDer GeneDer changed the title right size tests [Serve] Deflake test_metrics Sep 23, 2024
@GeneDer GeneDer marked this pull request as ready for review September 23, 2024 04:11
@GeneDer
Copy link
Contributor Author

GeneDer commented Sep 23, 2024

Tests 5 runs (29452, 29455, 29457, 29459, 29461) and not seeing any more failures with this change

@GeneDer GeneDer self-assigned this Sep 23, 2024
@GeneDer GeneDer requested a review from a team September 23, 2024 14:57
@edoakes
Copy link
Contributor

edoakes commented Sep 23, 2024

@GeneDer can you explain a little more about what the issue was an how this solves it? And also why is this specific to windows?

@GeneDer
Copy link
Contributor Author

GeneDer commented Sep 23, 2024

@GeneDer can you explain a little more about what the issue was an how this solves it? And also why is this specific to windows?

This test was running in windows before, there is no change on that.

I think the main issue is there are some race conditions between different tests running in this build and polluting the metrics, so often time we see this test failing (timed out with condition never met).

Also after I factor this test out, we are seeing those unexpected kwarg error, so those get fixed as well.

@edoakes
Copy link
Contributor

edoakes commented Sep 23, 2024

@GeneDer can you explain a little more about what the issue was an how this solves it? And also why is this specific to windows?

This test was running in windows before, there is no change on that.

I think the main issue is there are some race conditions between different tests running in this build and polluting the metrics, so often time we see this test failing (timed out with condition never met).

Also after I factor this test out, we are seeing those unexpected kwarg error, so those get fixed as well.

We should be able to set up the fixtures properly to clear all of the expected metrics. We rely on metrics in quite a few tests, so it's probably better to nail down the root cause there

@GeneDer GeneDer marked this pull request as draft September 23, 2024 23:20
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
@GeneDer GeneDer marked this pull request as ready for review September 26, 2024 14:46
@GeneDer
Copy link
Contributor Author

GeneDer commented Sep 26, 2024

@edoakes I refactored the call to clean up metrics between tests into fixture and call on each of the tests here. Can you PTAL?

requests.post(delete_all_series_url)
requests.post(clean_tombstones_url)


@pytest.fixture
def serve_start_shutdown():
"""Fixture provides a fresh Ray cluster to prevent metrics state sharing."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm shouldn't this be sufficient on its own? The prometheus endpoint is the raylet, so if ray is shut down between runs there should be no state sharing.

What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation is there are something that's not cleaned up in between those tests. And in fact adding those calls seems to helped. Now that thinking through it again maybe just adding some sleep in between will also help the same way and maybe the issue is serve and/or ray wasn't complete shutdown before the next test starts? 🤔 Let me do some more experiments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not add any sleeps -- if we need to wait for anything to clean up, then explicitly wait for the cleanup to happen

sleeps are what make things flaky in the first place

@GeneDer GeneDer marked this pull request as draft September 27, 2024 16:56
@GeneDer
Copy link
Contributor Author

GeneDer commented Oct 15, 2024

Close for now, will dig into this deeper when we have more bandwidth. There still seemed to have some port binding issues with this change

@GeneDer GeneDer closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI test windows://python/ray/serve/tests:test_metrics is consistently_failing
3 participants