You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GetClusterFromRemotePeers uses a hard-coded 10s timeout which is called when bootstrapping an etcd member against an existing cluster. This could be configurable via the existing bootstrap timeout option.
Why is this needed?
This has bit me amidst attempting to add new etcd members to my cluster which was being used with Patroni for a high-availability PostgreSQL setup, where-in the members endpoint was taking a few hundred milliseconds more than the fixed 10 second timeout. Of course, the source of the problem is that my etcd members could not reply in a timely manner (for which the cause remains TBD), but the ability to override this at runtime could have saved me a lot of time (I did try redeploying etcd with increased timeouts, however it took me some time to realize that none of them were applicable when bootstrapping from an existing cluster, so I had to build my own etcd container with code changes). I'd be happy to submit a patch for this assuming it's a desirable feature
The text was updated successfully, but these errors were encountered:
It's worth mentioning that I eventually identified the problem in case anyone runs into something similar. We have several different internal DNS servers that etcd members rely on in order to resolve peer hostnames. When one of them became unavailable (unfortunately the first one that was being tried), the time in which the resolvers took to fail over to the next nameserver exceeded that of this bootstrap timeout. Let pN be an uninitialized peer we were attempting to add, and pA and pB be existing cluster members. We invested a lot of time diagnosing the connection between pA (who pN was using to bootstrap from) and pN when it was lengthy name resolution between pA and pB that was causing the hang-up. I was able to reproduce the hanging bootstrap request via
What would you like to be added?
GetClusterFromRemotePeers
uses a hard-coded 10s timeout which is called when bootstrapping an etcd member against an existing cluster. This could be configurable via the existing bootstrap timeout option.Why is this needed?
This has bit me amidst attempting to add new etcd members to my cluster which was being used with Patroni for a high-availability PostgreSQL setup, where-in the
members
endpoint was taking a few hundred milliseconds more than the fixed 10 second timeout. Of course, the source of the problem is that my etcd members could not reply in a timely manner (for which the cause remains TBD), but the ability to override this at runtime could have saved me a lot of time (I did try redeploying etcd with increased timeouts, however it took me some time to realize that none of them were applicable when bootstrapping from an existing cluster, so I had to build my own etcd container with code changes). I'd be happy to submit a patch for this assuming it's a desirable featureThe text was updated successfully, but these errors were encountered: