-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Deflake test_metrics
#47750
[Serve] Deflake test_metrics
#47750
Conversation
Signed-off-by: Gene Su <[email protected]>
cac6f7e
to
a0d6bd0
Compare
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
@GeneDer can you explain a little more about what the issue was an how this solves it? And also why is this specific to windows? |
This test was running in windows before, there is no change on that. I think the main issue is there are some race conditions between different tests running in this build and polluting the metrics, so often time we see this test failing (timed out with condition never met). Also after I factor this test out, we are seeing those unexpected kwarg error, so those get fixed as well. |
We should be able to set up the fixtures properly to clear all of the expected metrics. We rely on metrics in quite a few tests, so it's probably better to nail down the root cause there |
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
@edoakes I refactored the call to clean up metrics between tests into fixture and call on each of the tests here. Can you PTAL? |
requests.post(delete_all_series_url) | ||
requests.post(clean_tombstones_url) | ||
|
||
|
||
@pytest.fixture | ||
def serve_start_shutdown(): | ||
"""Fixture provides a fresh Ray cluster to prevent metrics state sharing.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm shouldn't this be sufficient on its own? The prometheus endpoint is the raylet, so if ray is shut down between runs there should be no state sharing.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My expectation is there are something that's not cleaned up in between those tests. And in fact adding those calls seems to helped. Now that thinking through it again maybe just adding some sleep in between will also help the same way and maybe the issue is serve and/or ray wasn't complete shutdown before the next test starts? 🤔 Let me do some more experiments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not add any sleeps -- if we need to wait for anything to clean up, then explicitly wait for the cleanup to happen
sleeps are what make things flaky in the first place
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Close for now, will dig into this deeper when we have more bandwidth. There still seemed to have some port binding issues with this change |
Why are these changes needed?
Split out test_metrics to run on it's own so the metrics will not be polluted by other tests.
Related issue number
Closes #45843
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.