Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple RPC URLs in watchtower #4748

Merged

Conversation

nk4rter
Copy link

@nk4rter nk4rter commented Feb 3, 2025

Problem

Agave watchtower currently supports only one RPC URL. This RPC URL becomes a single point of failure, so we would like to have an ability to specify several RPC endpoints to have redundancy.
More details and discussion is available in this issue: #4621

Summary of Changes

  • Added --urls argument, that takes exactly 3 values and conflicts with --url (reasoning is described in README.md, also see the discussion in the issue)
  • Added watchtower-reliability test, alert for which will be triggered if less than endpoints.len() / 2 + 1 endpoints are reachable or if they provide inconsistent results.
  • Info that's collected from the cluster and the logic of how it's analyzed is unchanged. Only adding the logic on top of it of how the issues from multiple endpoints are aggregated.

@mergify mergify bot requested a review from a team February 3, 2025 11:18
@nk4rter nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch 2 times, most recently from abb96e8 to e1139ae Compare February 4, 2025 11:40
@nk4rter nk4rter marked this pull request as ready for review February 4, 2025 11:42
@willhickey willhickey self-requested a review February 4, 2025 17:51
@nk4rter nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch 2 times, most recently from 81a2e42 to 6cdb076 Compare February 11, 2025 12:15
@steviez steviez requested a review from t-nelson February 16, 2025 23:03
@willhickey
Copy link

Thank you for the PR! Sorry for the delay in reviewing but I'll get to it this week.

@mircea-c
Copy link

mircea-c commented Mar 3, 2025

Hi @nk4rter! Thank you for submitting this. Very welcome to have community contributions to the code. While it generally looks good to me, I do have a couple of comments. Please address them when you get the time.

})
.collect();

let min_agreeing_endpoints = endpoints.len() / 2 + 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no validation of the endpoints, this is a bit of a misnomer.

You can set each endpoint to a different network. I tested with one of each devnet / testnet / mainnet configured for the same process and watchtower didn't complain.

This needs some verification that the endpoints it's connecting to actually agree. I think the bare minimum should be that they all return the same genesis hash when queried. Ideally, watchtower could check for slot and / or block hash and try to do a meaningful comparison between the 3 endpoints at startup.

Copy link
Author

@nk4rter nk4rter Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Yea, it does make sense to compare genesis hashes as a sanity check. I'll do that.

Although I'm not sure if I fully understand what else could you check at startup... Like, if you query the slot number for example my understanding is that you're very likely to get slightly different slots from different nodes.

We could surely check that they are within certain range from each other. Should we do that, or did you mean something different? If so, should this range be configurable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. If you query 3 nodes, even in parallel, odds are you'll get slightly different slot numbers.

Comparing that they are within range is probably the easiest solution, and I think that would be enough for the purposes of this app.

Copy link
Author

@nk4rter nk4rter Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What would be an appropriate range for slots to check?
  • Is it fine to hardcode this range?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it needs to be configurable as values for the different networks (mainnet / testnet / devnet) will probably differ.
  • The default... hmm... using mainnet numbers of 150 slots per minute, maybe a third of that to start and see how it does. So within 50 slots of each other.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mircea-c I implemented endpoint validation at startup. Sorry for slight delay, had some busy time at work lately... Please review :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience @nk4rter. This LGTM now.

@nk4rter nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch from 6cdb076 to 16cd6e7 Compare March 13, 2025 11:20
@nk4rter nk4rter requested a review from mircea-c March 13, 2025 13:36
mircea-c
mircea-c previously approved these changes Mar 19, 2025
Copy link

@mircea-c mircea-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to 🚢

@nk4rter nk4rter requested a review from mircea-c March 20, 2025 13:21
@mircea-c mircea-c added the CI Pull Request is ready to enter CI label Mar 20, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Mar 20, 2025
@mircea-c mircea-c merged commit a2d7ead into anza-xyz:master Mar 20, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants