Configurable bootstrap timeout for existing clusters #19316

pchaseh · 2025-02-02T21:23:06Z

What would you like to be added?

GetClusterFromRemotePeers uses a hard-coded 10s timeout which is called when bootstrapping an etcd member against an existing cluster. This could be configurable via the existing bootstrap timeout option.

Why is this needed?

This has bit me amidst attempting to add new etcd members to my cluster which was being used with Patroni for a high-availability PostgreSQL setup, where-in the members endpoint was taking a few hundred milliseconds more than the fixed 10 second timeout. Of course, the source of the problem is that my etcd members could not reply in a timely manner (for which the cause remains TBD), but the ability to override this at runtime could have saved me a lot of time (I did try redeploying etcd with increased timeouts, however it took me some time to realize that none of them were applicable when bootstrapping from an existing cluster, so I had to build my own etcd container with code changes). I'd be happy to submit a patch for this assuming it's a desirable feature

The text was updated successfully, but these errors were encountered:

ivanvc · 2025-02-13T19:20:24Z

Discussed during our triage meeting. ~~This is a support question, so we'll move it to a discussion.~~

Can we have a TL to confirm if we want to do this? @ahrtr, @serathius.

pchaseh · 2025-02-13T19:50:44Z

It's worth mentioning that I eventually identified the problem in case anyone runs into something similar. We have several different internal DNS servers that etcd members rely on in order to resolve peer hostnames. When one of them became unavailable (unfortunately the first one that was being tried), the time in which the resolvers took to fail over to the next nameserver exceeded that of this bootstrap timeout. Let pN be an uninitialized peer we were attempting to add, and pA and pB be existing cluster members. We invested a lot of time diagnosing the connection between pA (who pN was using to bootstrap from) and pN when it was lengthy name resolution between pA and pB that was causing the hang-up. I was able to reproduce the hanging bootstrap request via

time curl -v --cacert ./etcd-peer.ca.pem --cert pN-cert.pem --key pN-key.pem -s -w "\n" --http1.1 https://pA:2380/members

pchaseh added the type/feature label Feb 2, 2025

ivanvc added the stage/triaged label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable bootstrap timeout for existing clusters #19316

Configurable bootstrap timeout for existing clusters #19316

pchaseh commented Feb 2, 2025

ivanvc commented Feb 13, 2025 •

edited

Loading

pchaseh commented Feb 13, 2025

Configurable bootstrap timeout for existing clusters #19316

Configurable bootstrap timeout for existing clusters #19316

Comments

pchaseh commented Feb 2, 2025

What would you like to be added?

Why is this needed?

ivanvc commented Feb 13, 2025 • edited Loading

pchaseh commented Feb 13, 2025

ivanvc commented Feb 13, 2025 •

edited

Loading