[RSDK-9366] Clear poisoned lock in tests #371

mattjperez · 2025-01-07T15:06:02Z

the global_network_test_lock is used to serialize access to network resources to ensure parallelized tests don't trip over one another. We don't actually care to propagate poisons across tests because each test is meant to be self-contained and the underlying network resources are in-fact available.

mattjperez · 2025-01-07T15:08:34Z

micro-rdk/src/lib.rs

+            lock.clear_poison();
+            lock_result.into_inner()


I believe there shouldn't be an issue with multiple threads since:
a. we still hold the LockResult which contains the guard
b. we return the guard

Another reason why this is OK is that lock poisoning is mostly about ensuring that the state guarded by the mutex can be trusted to be in a consistent state. In this case, the state is unit, so there is no possibility of it having an inconsistent state.

Honestly this would probably be better written with a binary semaphore, since there isn't really state to be protected, but there doesn't seem to be such a thing in the standard library right now that we can use.

That said, I think there should be a comment here above clear_poison explaining why we feel justified to clear the poison state.

acmorrow

LGTM mod a comment, but I would like a second opinion from another reviewer before this merges.

Meanwhile, have you been able to confirm that once you make this change that you see a failed test sometimes, but that it doesn't propagate like it did before?

acmorrow · 2025-01-07T16:36:16Z

micro-rdk/src/lib.rs

+            lock.clear_poison();
+            lock_result.into_inner()


Another reason why this is OK is that lock poisoning is mostly about ensuring that the state guarded by the mutex can be trusted to be in a consistent state. In this case, the state is unit, so there is no possibility of it having an inconsistent state.

Honestly this would probably be better written with a binary semaphore, since there isn't really state to be protected, but there doesn't seem to be such a thing in the standard library right now that we can use.

That said, I think there should be a comment here above clear_poison explaining why we feel justified to clear the poison state.

mattjperez · 2025-01-07T16:56:05Z

Meanwhile, have you been able to confirm that once you make this change that you see a failed test sometimes, but that it doesn't propagate like it did before?

I added a panic!() in one of the tests after it takes the lock, the other tests still pass.

npmenard

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

mattjperez · 2025-01-07T17:12:55Z

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

ideally it would never be poisoned, but this will at least reduce the poison-lock noise, showing just the relevant errors of the test failing for legitimate reasons.

acmorrow · 2025-01-07T17:17:50Z

Why does it lead to false positive? The mutex ensure tests are run sequentially if one fails then we should focus on fixing why the test fails. The mutex should never be poisoned no?

The mutex will be poisoned if a test fails due to a panic, and then all of the subsequent tests which attempt to use the lock will also fail, making it hard to see which test actually failed and which were just fallout from the lock poisoning. Since there isn't actually any meaningful state protected by the mutex, I believe it is OK to clear the poisoning, so that the other tests can meaningfully execute and we get a clear indication of which test failed. Which we should then of course fix.

Here is a playground link which will show the issue: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6161ddf7b147e9e58714b35ccdf7454e

You may need to run it 10 or so times before you get the ordering that shows the issue, but when ok_test runs after panicky_test you will see that both tests are reported as failed.

npmenard · 2025-01-07T18:54:01Z

Some tests do share resources even if not explicitly guarded by the Mutex (like sockets). It's fair to clear the poison but later tests may fail in less obvious ways. We should revisit this when we address CI and focus on fixing flaky tests if they happen. Too bad we can't have a fail-fast option :P But fine to merge it

mattjperez requested a review from a team as a code owner January 7, 2025 15:06

mattjperez commented Jan 7, 2025

View reviewed changes

acmorrow approved these changes Jan 7, 2025

View reviewed changes

npmenard reviewed Jan 7, 2025

View reviewed changes

mattjperez added 2 commits January 7, 2025 14:01

clear lock if poisoned

a138ff6

add comment to poison clearing

3e0d578

mattjperez force-pushed the rsdk-9366-poisoned-lock-in-tests branch from 18e27f1 to 3e0d578 Compare January 7, 2025 19:01

mattjperez merged commit 3baca97 into viamrobotics:main Jan 7, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RSDK-9366] Clear poisoned lock in tests #371

[RSDK-9366] Clear poisoned lock in tests #371

mattjperez commented Jan 7, 2025

mattjperez Jan 7, 2025

acmorrow Jan 7, 2025

acmorrow left a comment

acmorrow Jan 7, 2025

mattjperez commented Jan 7, 2025

npmenard left a comment

mattjperez commented Jan 7, 2025 •

edited

Loading

acmorrow commented Jan 7, 2025

npmenard commented Jan 7, 2025

[RSDK-9366] Clear poisoned lock in tests #371

[RSDK-9366] Clear poisoned lock in tests #371

Conversation

mattjperez commented Jan 7, 2025

mattjperez Jan 7, 2025

Choose a reason for hiding this comment

acmorrow Jan 7, 2025

Choose a reason for hiding this comment

acmorrow left a comment

Choose a reason for hiding this comment

acmorrow Jan 7, 2025

Choose a reason for hiding this comment

mattjperez commented Jan 7, 2025

npmenard left a comment

Choose a reason for hiding this comment

mattjperez commented Jan 7, 2025 • edited Loading

acmorrow commented Jan 7, 2025

npmenard commented Jan 7, 2025

mattjperez commented Jan 7, 2025 •

edited

Loading