Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable bootstrap timeout for existing clusters #19316

Open
pchaseh opened this issue Feb 2, 2025 · 2 comments
Open

Configurable bootstrap timeout for existing clusters #19316

pchaseh opened this issue Feb 2, 2025 · 2 comments

Comments

@pchaseh
Copy link

pchaseh commented Feb 2, 2025

What would you like to be added?

GetClusterFromRemotePeers uses a hard-coded 10s timeout which is called when bootstrapping an etcd member against an existing cluster. This could be configurable via the existing bootstrap timeout option.

Why is this needed?

This has bit me amidst attempting to add new etcd members to my cluster which was being used with Patroni for a high-availability PostgreSQL setup, where-in the members endpoint was taking a few hundred milliseconds more than the fixed 10 second timeout. Of course, the source of the problem is that my etcd members could not reply in a timely manner (for which the cause remains TBD), but the ability to override this at runtime could have saved me a lot of time (I did try redeploying etcd with increased timeouts, however it took me some time to realize that none of them were applicable when bootstrapping from an existing cluster, so I had to build my own etcd container with code changes). I'd be happy to submit a patch for this assuming it's a desirable feature

@ivanvc
Copy link
Member

ivanvc commented Feb 13, 2025

Discussed during our triage meeting. This is a support question, so we'll move it to a discussion.

Can we have a TL to confirm if we want to do this? @ahrtr, @serathius.

@pchaseh
Copy link
Author

pchaseh commented Feb 13, 2025

It's worth mentioning that I eventually identified the problem in case anyone runs into something similar. We have several different internal DNS servers that etcd members rely on in order to resolve peer hostnames. When one of them became unavailable (unfortunately the first one that was being tried), the time in which the resolvers took to fail over to the next nameserver exceeded that of this bootstrap timeout. Let pN be an uninitialized peer we were attempting to add, and pA and pB be existing cluster members. We invested a lot of time diagnosing the connection between pA (who pN was using to bootstrap from) and pN when it was lengthy name resolution between pA and pB that was causing the hang-up. I was able to reproduce the hanging bootstrap request via

time curl -v --cacert ./etcd-peer.ca.pem --cert pN-cert.pem --key pN-key.pem -s -w "\n" --http1.1 https://pA:2380/members

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants