Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9366] Clear poisoned lock in tests #371

Merged

Conversation

mattjperez
Copy link
Member

the global_network_test_lock is used to serialize access to network resources to ensure parallelized tests don't trip over one another. We don't actually care to propagate poisons across tests because each test is meant to be self-contained and the underlying network resources are in-fact available.

@mattjperez mattjperez requested a review from a team as a code owner January 7, 2025 15:06
Comment on lines +216 to +219
lock.clear_poison();
lock_result.into_inner()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there shouldn't be an issue with multiple threads since:
a. we still hold the LockResult which contains the guard
b. we return the guard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why this is OK is that lock poisoning is mostly about ensuring that the state guarded by the mutex can be trusted to be in a consistent state. In this case, the state is unit, so there is no possibility of it having an inconsistent state.

Honestly this would probably be better written with a binary semaphore, since there isn't really state to be protected, but there doesn't seem to be such a thing in the standard library right now that we can use.

That said, I think there should be a comment here above clear_poison explaining why we feel justified to clear the poison state.

Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod a comment, but I would like a second opinion from another reviewer before this merges.

Meanwhile, have you been able to confirm that once you make this change that you see a failed test sometimes, but that it doesn't propagate like it did before?

Comment on lines +216 to +219
lock.clear_poison();
lock_result.into_inner()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why this is OK is that lock poisoning is mostly about ensuring that the state guarded by the mutex can be trusted to be in a consistent state. In this case, the state is unit, so there is no possibility of it having an inconsistent state.

Honestly this would probably be better written with a binary semaphore, since there isn't really state to be protected, but there doesn't seem to be such a thing in the standard library right now that we can use.

That said, I think there should be a comment here above clear_poison explaining why we feel justified to clear the poison state.

@mattjperez
Copy link
Member Author

Meanwhile, have you been able to confirm that once you make this change that you see a failed test sometimes, but that it doesn't propagate like it did before?

I added a panic!() in one of the tests after it takes the lock, the other tests still pass.

Copy link
Member

@npmenard npmenard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

@mattjperez
Copy link
Member Author

mattjperez commented Jan 7, 2025

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

ideally it would never be poisoned, but this will at least reduce the poison-lock noise, showing just the relevant errors of the test failing for legitimate reasons.

@acmorrow
Copy link
Member

acmorrow commented Jan 7, 2025

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

The mutex will be poisoned if a test fails due to a panic, and then all of the subsequent tests which attempt to use the lock will also fail, making it hard to see which test actually failed and which were just fallout from the lock poisoning. Since there isn't actually any meaningful state protected by the mutex, I believe it is OK to clear the poisoning, so that the other tests can meaningfully execute and we get a clear indication of which test failed. Which we should then of course fix.

Here is a playground link which will show the issue: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6161ddf7b147e9e58714b35ccdf7454e

You may need to run it 10 or so times before you get the ordering that shows the issue, but when ok_test runs after panicky_test you will see that both tests are reported as failed.

@npmenard
Copy link
Member

npmenard commented Jan 7, 2025

Some tests do share resources even if not explicitly guarded by the Mutex (like sockets). It's fair to clear the poison but later tests may fail in less obvious ways. We should revisit this when we address CI and focus on fixing flaky tests if they happen. Too bad we can't have a fail-fast option :P But fine to merge it

@mattjperez mattjperez force-pushed the rsdk-9366-poisoned-lock-in-tests branch from 18e27f1 to 3e0d578 Compare January 7, 2025 19:01
@mattjperez mattjperez merged commit 3baca97 into viamrobotics:main Jan 7, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants