Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report cache_peer context in probe and standby pool messages #277

Open
wants to merge 16 commits into
base: bag51
Choose a base branch
from

Conversation

eduard-bagdasaryan
Copy link

@eduard-bagdasaryan eduard-bagdasaryan commented Dec 2, 2024

ERROR: Failed to establish a TCP connection
    error: timeout while trying to established a TCP connection
    destination: ...
    Squid-to-peer TCP connection: ... cache_peer=peerXyz flags=1
    seconds past the establishment deadline: 1

The absence of the usual "current master transaction:..." detail in
certain errors raises “Has Squid lost the current transaction context?”
red flags. In some cases, Squid may have lost that context, but for
cache_peer TCP probes, Squid has not because those probes are not
associated with master transactions. It is difficult to distinguish the
two cases because no processing context is reported. To address those
concerns, we now report current cache_peer probing context (instead of
just not reporting absent master transaction context):

ERROR: ...
    current cache_peer probe: peerXyz

When maintaining a cache_peer standy=N connection pool, Squid has and
now reports both contexts, attributing messages to pool maintenance:

ERROR: ...
    current cache_peer standy pool: peerXyz
    current master transaction: master42

The new PrecomputedCodeContext class handles both reporting cases and
can be reused whenever the cost of pre-computing detailCodeContext()
output is acceptable.

src/neighbors.cc Outdated
@@ -1113,7 +1113,7 @@ neighborUp(const CachePeer * p)
{
if (!p->tcp_up) {
// TODO: When CachePeer gets its own CodeContext, pass that context instead of nullptr
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this PR completes this TODO:

Suggested change
// TODO: When CachePeer gets its own CodeContext, pass that context instead of nullptr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/CachePeer.cc Outdated
CachePeerProbeCodeContext::detailCodeContext(std::ostream &os) const
{
if (peer.valid())
os << Debug::Extra << "current cache_peer probe: " << peer->name;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see our commit a555a85 regarding CachePeer reporting rules.

Suggested change
os << Debug::Extra << "current cache_peer probe: " << peer->name;
os << Debug::Extra << "current cache_peer probe: " << *peer;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

src/CachePeer.cc Outdated
Comment on lines 120 to 123
if (peer.valid())
return peer->id.detach();

return ScopedId("CachePeerProbeCodeContext w/o peer");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this code and not add CachePeer::id in this PR. We already have a TODO about adding InstanceId to such classes. Also, the whole idea of having codeContextGist() returning an identifier may be an API bug: We should rethink how FadingCodeContext works, so that we can avoid identifier copying and convert this method into an std::ostream-printing method.

Suggested change
if (peer.valid())
return peer->id.detach();
return ScopedId("CachePeerProbeCodeContext w/o peer");
// Unfortunately, peer->name lifetime is too short. See also: AnyP::PortCfg::codeContextGist().
return ScopedId("cache_peer probe");

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -18,6 +18,20 @@ class HttpRequest;
class CachePeer;
class CommConnectCbParams;

class PeerPoolMgrContext : public CodeContext
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see whether this class can be generalized into DetailedCodeContext class that takes an SBuf detail and a c-string gist as constructor parameters (and stores that info in its data members, to be reported using CodeContext APIs). If that works, use DetailedCodeContext instead of CachePeerProbeCodeContext as well.

Eventually, CachePeer will be reference-counted and, hence, may not need a dedicated CodeContext data member, but (re)using DetailedCodeContext for CachePeer now may help us make progress while those (complex) changes are pending (and waiting other/stuck smooth reconfiguration PRs).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Author

@eduard-bagdasaryan eduard-bagdasaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (30978d0).

src/CachePeer.cc Outdated
Comment on lines 120 to 123
if (peer.valid())
return peer->id.detach();

return ScopedId("CachePeerProbeCodeContext w/o peer");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

src/CachePeer.cc Outdated
CachePeerProbeCodeContext::detailCodeContext(std::ostream &os) const
{
if (peer.valid())
os << Debug::Extra << "current cache_peer probe: " << peer->name;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

src/neighbors.cc Outdated
@@ -1113,7 +1113,7 @@ neighborUp(const CachePeer * p)
{
if (!p->tcp_up) {
// TODO: When CachePeer gets its own CodeContext, pass that context instead of nullptr
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -18,6 +18,20 @@ class HttpRequest;
class CachePeer;
class CommConnectCbParams;

class PeerPoolMgrContext : public CodeContext
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/neighbors.cc Outdated
ipcache_nbgethostbyname(p->host, peerDNSConfigure, p);
for (p = Config.peers; p; p = p->next) {
CallService(p->probeCodeContext, [&] {
ipcache_nbgethostbyname(p->host, peerDNSConfigure, p);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CallService() is for switching to "service" context. ipcache_nbgethostbyname() service is not peer-specific -- the DNS lookup service serves all peers and may even share one lookup across multiple service requesting clients. We should be setting the service caller context here, not the service context. I think it would be more appropriate to use explicit CodeContext::Reset() calls in this loop, just like in neighborsUdpPing() loop modified in this PR.

N.B. peerProbeConnect() is a peer-specific "service".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -32,7 +33,8 @@ PeerPoolMgr::PeerPoolMgr(CachePeer *aPeer): AsyncJob("PeerPoolMgr"),
request(),
transportWait(),
encryptionWait(),
addrUsed(0)
addrUsed(0),
context(new DetailedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool", *peer)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing delimiter?

Suggested change
context(new DetailedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool", *peer)))
context(new DetailedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool: ", *peer)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -69,6 +92,9 @@ class PeerPoolMgr: public AsyncJob
JobWait<Security::BlindPeerConnector> encryptionWait;

unsigned int addrUsed; ///< counter for cycling through peer addresses

public:
DetailedCodeContext::Pointer context;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding CodeContext to PeerPoolMgr job is a red flag for me: We have numerous AsyncJobs that do master transaction-related work. Virtually none1 of them store a CodeContext pointer to report that master transaction -- after their ALE-based code context is established, those jobs automatically get the right context via AsyncCalls and such. DetailedCodeContext is different from ALE-based context, but I see no reason for PeerPoolMgr job to be different with regard to its overall context storage approach.

If we can apply existing ALE CodeContext handling/storage approach to PeerPoolMgr jobs without heroic efforts, let's do that instead of adding this data member.

Footnotes

  1. TunnelStateData stores CodeContext but with a TODO to remove that storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's move it from the job, but we have to keep it somewhere anyway in order to pass to PeerPoolMgr::checkpoint(). I moved it to CachePeer::standby structure, which looks logical.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's move it from the job,

Hm... Its precise location (inside the job) was not the red flag I was trying to describe. The red flag was the addition of that CodeContext data member (as such, in any class), not the addition of PeerPoolMgr::context (specifically). Sorry.

we have to keep it somewhere anyway in order to pass to PeerPoolMgr::checkpoint()

Aha, PeerPoolMgr::Checkpoint() is the piece I was missing! We have to store this CodeContext (somewhere) because PeerPoolMgr::Checkpoint() is not called via events or async calls and, hence, cannot reuse those mechanisms for relaying context back into PeerPoolMgr job. PeerPoolMgr is probably the best location for this context storage because this context is specific to PeerPoolMgr. Please undo that move.

Copy link

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are zeroing-in on the final solution here. Please note that some change requests will require more code adjustments than a single GitHub suggestion can express.

Comment on lines 35 to 36

void setMasterXaction(const MasterXaction::Pointer &mx) { masterXaction = mx; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add this unused method:

Suggested change
void setMasterXaction(const MasterXaction::Pointer &mx) { masterXaction = mx; }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

transportWait(),
encryptionWait(),
addrUsed(0)
{
const auto mx = MasterXaction::MakePortless<XactionInitiator::initPeerPool>();
context = new DetailedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool: ", *peer), mx);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR code avoids storing CachePeer inside DetailedCodeContext by computing the corresponding detail in advance. Let's assume that decision to precompute was correct.

If possible, please apply the same "let's precompute!" decision to MasterXaction storage, eliminating DetailedCodeContext::masterXaction data member and generalizing/simplifying DetailedCodeContext to better match the new class name/intent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new SBuf field (storing mx->id) instead.

src/PeerPoolMgr.cc Outdated Show resolved Hide resolved
public:
typedef RefCount<DetailedCodeContext> Pointer;

DetailedCodeContext(const char *gist, const SBuf &detail, const MasterXaction::Pointer &mx) : gist_(gist),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, please use natural language "items: item1, item2" formatting for labeled lists:

Suggested change
DetailedCodeContext(const char *gist, const SBuf &detail, const MasterXaction::Pointer &mx) : gist_(gist),
DetailedCodeContext(const char *gist, const SBuf &detail, const MasterXaction::Pointer &mx): gist_(gist),

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 21 to 22
/// CodeContext for cache_peer related classes
class DetailedCodeContext : public CodeContext
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming other change requests in this review are correct:

Suggested change
/// CodeContext for cache_peer related classes
class DetailedCodeContext : public CodeContext
/// CodeContext with constant details known at construction time
class PrecomputedCodeContext: public CodeContext

And move this declaration to src/base/PrecomputedCodeContext.{h,cc}.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -32,6 +56,8 @@ class PeerPoolMgr: public AsyncJob
explicit PeerPoolMgr(CachePeer *aPeer);
~PeerPoolMgr() override;

DetailedCodeContext::Pointer context; ///< the pool manager context
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename -- there are too many contexts in this context :-).

Suggested change
DetailedCodeContext::Pointer context; ///< the pool manager context
DetailedCodeContext::Pointer codeContext;

I cannot think of any useful documentation for this self-documented declaration. The proposed description in current PR code reads like "foo is foo" to me. Existing official code has many self-documented codeContext data members, so I hope we do not have to describe this one either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Copy link
Author

@eduard-bagdasaryan eduard-bagdasaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (7fd1491).

transportWait(),
encryptionWait(),
addrUsed(0)
{
const auto mx = MasterXaction::MakePortless<XactionInitiator::initPeerPool>();
context = new DetailedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool: ", *peer), mx);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new SBuf field (storing mx->id) instead.

@@ -32,6 +56,8 @@ class PeerPoolMgr: public AsyncJob
explicit PeerPoolMgr(CachePeer *aPeer);
~PeerPoolMgr() override;

DetailedCodeContext::Pointer context; ///< the pool manager context
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Comment on lines 35 to 36

void setMasterXaction(const MasterXaction::Pointer &mx) { masterXaction = mx; }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 21 to 22
/// CodeContext for cache_peer related classes
class DetailedCodeContext : public CodeContext
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
typedef RefCount<DetailedCodeContext> Pointer;

DetailedCodeContext(const char *gist, const SBuf &detail, const MasterXaction::Pointer &mx) : gist_(gist),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/PeerPoolMgr.cc Outdated Show resolved Hide resolved
src/PeerPoolMgr.cc Outdated Show resolved Hide resolved
src/base/PrecomputedCodeContext.cc Outdated Show resolved Hide resolved
@rousskov rousskov changed the title Use cache_peer context to report probe failures Report cache_peer context in probe and standby pool messages Dec 10, 2024
Copy link

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two minor adjustments. Please resolve merge conflicts as well.

Please also review new PR title/description and adjust as needed.


codeContext = new PrecomputedCodeContext("cache_peer standby pool", ToSBuf("current cache_peer standby pool: ", *peer,
Debug::Extra, "current master transaction: ", mx->id));
// ErrorState, getOutgoingAddress(), and other APIs may require a request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep request-related code lines separate:

Suggested change
// ErrorState, getOutgoingAddress(), and other APIs may require a request.
// ErrorState, getOutgoingAddress(), and other APIs may require a request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include "base/InstanceId.h"
#include "sbuf/SBuf.h"

/// CodeContext with constant details known at construction time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, compiling this header code requires including <ostream> because this code uses "<<" expressions. Please include that STL header even if it is currently included (directly or indirectly) by the headers above.

Suggested change
/// CodeContext with constant details known at construction time
#include <ostream>
/// CodeContext with constant details known at construction time

IIRC, including <iosfwd> is not enough in such cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants