CI: Enable automated valgrind tests #280

rousskov · 2025-01-07T16:03:27Z

TBD

It is best to version valgrind suppression file in Squid repository (as opposed to some CI tool repository) because suppressions are Squid-version specific. The suppression file should not rot too much because automated CI tests will flag pull requests that add leaks. It would be nice if CI tests could also warn about no-longer-needed suppressions, but implementing that is difficult because it requires analyzing valgrind logs across all CI tests that use valgrind (including across all tested Squid build environments/configurations). SquidMain() was split into two functions (and two AsyncJob::Start() calls were moved) to be able to ignore leaks that happen, roughly speaking, _before_ the start of the main loop. See initialization-before-primary-loop suppression. This split would not be necessary if valgrind supported "negative" suppressions (i.e. if valgrind could suppress call stacks that do _not_ include Foo()).

* lack of ip_table cleanup affects DNS queries, not just /etc/hosts * valgrind reports "Invalid read of size 8" errors for OpenSSL code used by Security::ServerOptions::updateContextEecdh() Also fixed a spelling error in a comment and removed a placeholder(?).

IP cache entries store callbacks. Callbacks may lock objects that are awaiting them. If that happens, Valgrind will show a memory leak. Eliminating these false leak signals helps in leak triage by removing difficult-to-rule-out suspects. Context: Looking for an object that holds leaking HttpRequest that holds leaking ConnStateData. Seeing leaking ipcache_entry objects that were allocated after ConnStateData was. Suspecting that an ipcache_entry may keep a shared pointer lock on something that is related to HttpRequest. This change rejected that suspicion, but stopped ipcache_entry leaks.

This change partially addresses XXX in the beginning of SquidShutdown() by allowing connection closure handlers in helper code to run for two TLS-related helpers. TODO: Apply similar changes to all helpers. Without this change, Helper::Session objects are not destroyed at Squid exit, and Valgrind reports at least two MemBuf memory leaks associated with Helper::Session fields: 32 bytes in 1 blocks are definitely lost in loss record 354 of 1,458 at 0x4848899: malloc (vg_replace_malloc.c:381) by 0xAE4247: xmalloc (xalloc.cc:105) by 0xAC4202: MemPoolMalloc::allocate() (PoolMalloc.cc:39) by 0x695C19: Mem::Allocator::alloc() (Allocator.h:47) by 0xAC0A95: Mem::AllocatorProxy::alloc() (AllocatorProxy.cc:18) by 0x695EA6: cbdata::operator new(unsigned long) (cbdata.cc:48) by 0x694FC0: cbdataInternalAlloc(int) (cbdata.cc:154) by 0x585932: MemBuf::operator new(unsigned long) (MemBuf.h:25) by 0x70A053: helperDispatch(Helper::Session*, Helper::Xaction*) (helper.cc:1478) by 0x703140: Helper::Client::submitRequest(Helper::Xaction*) (helper.cc:464) 32 bytes in 1 blocks are definitely lost in loss record 355 of 1,458 at 0x4848899: malloc (vg_replace_malloc.c:381) by 0xAE4247: xmalloc (xalloc.cc:105) by 0xAC4202: MemPoolMalloc::allocate() (PoolMalloc.cc:39) by 0x695C19: Mem::Allocator::alloc() (Allocator.h:47) by 0xAC0A95: Mem::AllocatorProxy::alloc() (AllocatorProxy.cc:18) by 0x695EA6: cbdata::operator new(unsigned long) (cbdata.cc:48) by 0x694FC0: cbdataInternalAlloc(int) (cbdata.cc:154) by 0x585932: MemBuf::operator new(unsigned long) (MemBuf.h:25) by 0x70A053: helperDispatch(Helper::Session*, Helper::Xaction*) (helper.cc:1478) by 0x703140: Helper::Client::submitRequest(Helper::Xaction*) (helper.cc:464)

…er I/O One positive side effect of this workaround is that ALE and/or other CodeContext objects have fewer chances to get stuck at exit (and be declared as leaking by valgrind or find-alive.pl).

It is not yet clear whether cleaning IP cache on shutdown is a good idea in general, but it should _not_ help with Valgrind memory leak reports related to lookup callbacks (i.e. the primary motivation behind adding this cleanup code) because the corresponding IP cache entries ought to be locked[^1], and the new cache cleaning code skips locked entries. [^1]: If entries with callbacks are not locked, then the requestor will get stuck when the entry is purged (e.g., for cache capacity reasons, during regular Squid operations; see ipcache_purgelru()). This reverts commit 9ab2f85.

This suppression is not necessary to pass current tests, but it is a good idea to suppress this known case of a limited-allocations "global", especially since keeping those allocations "exposed" complicates triage of other leak suspects by inciting developers to chaise this red herring.

When Squid is built with optimizations, this function is usually inlined. Valgrind does not mangle inlined functions when checking suppressions, so the original fun:_ZL14initializeOnceiPPc suppression does not match in optimized builds! To combat the above problem, I added fun:initializeOnce, a second initializeOnce() suppression. This one was targeting optimized builds. After doing that, I began to worry that this function name is too generic for its essentially "global" suppression context and renamed it to SquidMainInitializeOnce(). Further testing suggests that dealing with optimized builds is probably too difficult -- too many suppressions need to be "duplicated" to accommodate optimization variations. The next change will commit to non-optimized builds and will remove initialization-before-primary-loop-optimized suppression added here, but I wanted to keep this change for the record. Also polished initializeOnce() description.

As detailed in the previous branch commit, testing optimized builds requires maintaining multiple sets of suppressions. Technically, the number of suppressions per "leak" can get large due to optimization variations. The number of variations can be cut down using function name masks, but such masks decrease suppression precision, increasing the risk of accidentally matching an unknown/true/undisclosed leak. Non-optimized builds also take less time to compile and ease triage of coredumps and similar problems. The drawback of using non-optimized builds is that deployed Squids run optimized code. Ideally, we want to test deployed code. TODO: Consider testing both builds, one with valgrind and one without.

Without this option, Valgrind reports too many false leaks. I did not investigate why, assuming that this has something to do with cbdata memory allocation trick in regular builds (and that this ./configure option was added primarily to avoid those false positives).

The old "start as nobody" scheme was too awkward to support, especially as we start adding more configuration files (e.g., suppression file) because, on GitHub Actions runners, user "nobody" cannot read cloned sources. Moreover, it is arguably better to start Squid as root because many (most?) deployments probably do that, and some Squid features require that. Latest squid-overlord agents listen on localhost (by default) and, hence, are far more difficult to abuse, even when running as root, in most test environments. Also reduce chances that a developer starts Overlord accidentally by effectively requiring passwordless sudo (which is typical for CI environments). Folks can still start Overlord manually (as any user), and test-functionality.sh will just use that running agent, of course. Requires squid-overlord commit 953ef9b6. Benefits from squid-overlord commit 47cf864e.

thus preventing Valgrind from reporting leak problems.

This reverts commit 08ec76e. That (experimental) commit was wrong -- HeaderUpdater leak was caused by a different problem. The correct fix will be committed next.

This fix also reduces memory leak false positives, reported by Valgrind. ---- Cherry-picked SQUID-1034-convert-cbdata-htable-global commit 2d2ffd0 that became master/v8 commit bbcef1c (with an adjusted commit message).

... fixed by the previous branch commit. Adjusted Ipc::UdsSender leak suppressions. Those leaks were (evidently incorrectly) suspected to be caused by HeaderUpdater leaks, but they are still present after HeaderUpdater leaks were fixed.

Resolved minor conflicts to facilitate GitHub Actions branch tests and future bag71 updates with recent branch fixes.

rousskov added 8 commits November 13, 2024 08:58

WIP: SQUID-683 workaround: Do not hijack transaction context for help…

5cbd862

…er I/O One positive side effect of this workaround is that ALE and/or other CodeContext objects have fewer chances to get stuck at exit (and be declared as leaking by valgrind or find-alive.pl).

Suppress more "needs investigation" leaks

c20b1d5

rousskov changed the base branch from master to bag71 January 7, 2025 16:04

rousskov and others added 12 commits January 7, 2025 11:14

Merge master/v7 to resolve trivial conflicts with recent workflows split

387c0b4

fixup: Clarify "before-primary-loop" does not imply "before main()"

8b1eba9

fixup: Document that these suppressions need --disable-optimizations

3b13cbb

Stop the HeaderUpdater job on shutdown

08ec76e

thus preventing Valgrind from reporting leak problems.

Revert "Stop the HeaderUpdater job on shutdown"

41d6d08

This reverts commit 08ec76e. That (experimental) commit was wrong -- HeaderUpdater leak was caused by a different problem. The correct fix will be committed next.

NoNewGlobals for cbdata_htable

a5ee42d

This fix also reduces memory leak false positives, reported by Valgrind. ---- Cherry-picked SQUID-1034-convert-cbdata-htable-global commit 2d2ffd0 that became master/v8 commit bbcef1c (with an adjusted commit message).

Merged bag71 into SQUID-1011-ci-test-reconfiguration

1c51e37

Resolved minor conflicts to facilitate GitHub Actions branch tests and future bag71 updates with recent branch fixes.

rousskov changed the base branch from bag71 to master February 12, 2025 14:41

rousskov changed the base branch from master to bag71 February 12, 2025 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Enable automated valgrind tests #280

CI: Enable automated valgrind tests #280

rousskov commented Jan 7, 2025

CI: Enable automated valgrind tests #280

Are you sure you want to change the base?

CI: Enable automated valgrind tests #280

Conversation

rousskov commented Jan 7, 2025