-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Deflake test_metrics
#47750
Closed
Closed
[Serve] Deflake test_metrics
#47750
Changes from all commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
a0d6bd0
right size tests
GeneDer 20bb4b5
trigger another build
GeneDer e688c20
factor out test_metrics on it's own and use large sized test
GeneDer ecbc2f6
fix
GeneDer 320d1ba
fix tag
GeneDer 1870bdc
Merge branch 'master' into deflak-test-metrics
GeneDer 70fbb9d
fix kwargs
GeneDer 9165510
try again
GeneDer d15d0d0
test again
GeneDer 2ffefce
test again
GeneDer 8e22021
test again
GeneDer efceb03
revert change and add logics to clean up metrics between tests
GeneDer eb15873
lint
GeneDer 9502dd4
check health for prometheus before cleanup
GeneDer 71d1336
refactor clean up metrics as a fixture
GeneDer dab214c
test again
GeneDer 22f9145
test again
GeneDer 46ce3a0
test again
GeneDer 3e63207
test again
GeneDer 0a5cbf0
clean up serve and ray before and after the tests
GeneDer 04ee77c
try again
GeneDer 869594b
try again
GeneDer 0838d2a
Merge branch 'master' into deflak-test-metrics
GeneDer 25cc12a
Merge branch 'master' into deflak-test-metrics
GeneDer ff1a839
Merge branch 'master' into deflak-test-metrics
GeneDer 1482c03
try again
GeneDer e9a88c6
try again
GeneDer 747d479
only decrement num_scheduling_tasks_in_backoff if it's greater than 0
GeneDer 79fce0e
try again
GeneDer d924b7e
try again
GeneDer 650452c
try again
GeneDer e0aa69c
wait for proxies to be healthy before starting any tests
GeneDer 64d88a7
try again
GeneDer 853662d
Merge branch 'master' into deflak-test-metrics
GeneDer 709a0a9
try again
GeneDer 00a45bf
try again
GeneDer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -443,4 +443,3 @@ py_test_module_list( | |
"//python/ray/serve:serve_lib", | ||
], | ||
) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm shouldn't this be sufficient on its own? The prometheus endpoint is the raylet, so if ray is shut down between runs there should be no state sharing.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My expectation is there are something that's not cleaned up in between those tests. And in fact adding those calls seems to helped. Now that thinking through it again maybe just adding some sleep in between will also help the same way and maybe the issue is serve and/or ray wasn't complete shutdown before the next test starts? 🤔 Let me do some more experiments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not add any sleeps -- if we need to wait for anything to clean up, then explicitly wait for the cleanup to happen
sleeps are what make things flaky in the first place