Support multiple RPC URLs in watchtower #4748

nk4rter · 2025-02-03T11:18:22Z

Problem

Agave watchtower currently supports only one RPC URL. This RPC URL becomes a single point of failure, so we would like to have an ability to specify several RPC endpoints to have redundancy.
More details and discussion is available in this issue: #4621

Summary of Changes

Added --urls argument, that takes exactly 3 values and conflicts with --url (reasoning is described in README.md, also see the discussion in the issue)
Added watchtower-reliability test, alert for which will be triggered if less than endpoints.len() / 2 + 1 endpoints are reachable or if they provide inconsistent results.
Info that's collected from the cluster and the logic of how it's analyzed is unchanged. Only adding the logic on top of it of how the issues from multiple endpoints are aggregated.

willhickey · 2025-02-25T16:56:27Z

Thank you for the PR! Sorry for the delay in reviewing but I'll get to it this week.

mircea-c · 2025-03-03T20:43:06Z

Hi @nk4rter! Thank you for submitting this. Very welcome to have community contributions to the code. While it generally looks good to me, I do have a couple of comments. Please address them when you get the time.

mircea-c · 2025-03-03T20:15:42Z

watchtower/src/main.rs

+        })
+        .collect();
+
+    let min_agreeing_endpoints = endpoints.len() / 2 + 1;


Since there is no validation of the endpoints, this is a bit of a misnomer.

You can set each endpoint to a different network. I tested with one of each devnet / testnet / mainnet configured for the same process and watchtower didn't complain.

This needs some verification that the endpoints it's connecting to actually agree. I think the bare minimum should be that they all return the same genesis hash when queried. Ideally, watchtower could check for slot and / or block hash and try to do a meaningful comparison between the 3 endpoints at startup.

Thanks for the review! Yea, it does make sense to compare genesis hashes as a sanity check. I'll do that.

Although I'm not sure if I fully understand what else could you check at startup... Like, if you query the slot number for example my understanding is that you're very likely to get slightly different slots from different nodes.

We could surely check that they are within certain range from each other. Should we do that, or did you mean something different? If so, should this range be configurable?

You are correct. If you query 3 nodes, even in parallel, odds are you'll get slightly different slot numbers.

Comparing that they are within range is probably the easiest solution, and I think that would be enough for the purposes of this app.

What would be an appropriate range for slots to check?

Is it fine to hardcode this range?

I think it needs to be configurable as values for the different networks (mainnet / testnet / devnet) will probably differ.

The default... hmm... using mainnet numbers of 150 slots per minute, maybe a third of that to start and see how it does. So within 50 slots of each other.

@mircea-c I implemented endpoint validation at startup. Sorry for slight delay, had some busy time at work lately... Please review :)

Thank you for your patience @nk4rter. This LGTM now.

watchtower/src/main.rs

…lity if results are inconsistent

mircea-c

Ready to 🚢

mergify bot added community need:merge-assist labels Feb 3, 2025

mergify bot requested a review from a team February 3, 2025 11:18

nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch 2 times, most recently from abb96e8 to e1139ae Compare February 4, 2025 11:40

nk4rter marked this pull request as ready for review February 4, 2025 11:42

willhickey self-requested a review February 4, 2025 17:51

nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch 2 times, most recently from 81a2e42 to 6cdb076 Compare February 11, 2025 12:15

steviez requested a review from t-nelson February 16, 2025 23:03

mircea-c reviewed Mar 3, 2025

View reviewed changes

nk4rter added 12 commits March 13, 2025 13:20

Accepting multiple urls in watchtower

426703d

Getting data from multiple endpoints

59ed580

Collect and report multiple failures, add watchtower-reliability alert

6addd1b

Take exactly 1 or 3 URLs in watchtower

4a95ae8

Make failure mesasge order stable, concatenate msgs for thresh skip

5be57ca

Send concatenated msg to the notifier

1fc0832

Cleaup: use original tuple instead of EndpointFailure

e631985

Ignoring failures if num_healthy >= min_agreeing_endpoints

754de2d

Bring watchtower readme up to date

76a7f59

Unrealiability takes priority over other failures, trigger unrealiabi…

02620f8

…lity if results are inconsistent

Add some minor fixes

ed5c766

Implement endpoint validation

16cd6e7

nk4rter force-pushed the feature/support-multiple-rpc-urls-in-watchtower branch from 6cdb076 to 16cd6e7 Compare March 13, 2025 11:20

nk4rter requested a review from mircea-c March 13, 2025 13:36

mircea-c previously approved these changes Mar 19, 2025

View reviewed changes

Fix clippy issues

c387725

nk4rter dismissed mircea-c’s stale review via c387725 March 19, 2025 21:54

nk4rter requested a review from mircea-c March 20, 2025 13:21

mircea-c added the CI Pull Request is ready to enter CI label Mar 20, 2025

anza-team removed the CI Pull Request is ready to enter CI label Mar 20, 2025

mircea-c approved these changes Mar 20, 2025

View reviewed changes

mircea-c merged commit a2d7ead into anza-xyz:master Mar 20, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple RPC URLs in watchtower #4748

Support multiple RPC URLs in watchtower #4748

nk4rter commented Feb 3, 2025 •

edited

Loading

willhickey commented Feb 25, 2025

mircea-c commented Mar 3, 2025

mircea-c Mar 3, 2025

nk4rter Mar 5, 2025 •

edited

Loading

mircea-c Mar 5, 2025

nk4rter Mar 5, 2025 •

edited

Loading

mircea-c Mar 5, 2025

nk4rter Mar 13, 2025

mircea-c Mar 19, 2025

mircea-c left a comment

Support multiple RPC URLs in watchtower #4748

Support multiple RPC URLs in watchtower #4748

Conversation

nk4rter commented Feb 3, 2025 • edited Loading

Problem

Summary of Changes

willhickey commented Feb 25, 2025

mircea-c commented Mar 3, 2025

mircea-c Mar 3, 2025

Choose a reason for hiding this comment

nk4rter Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

mircea-c Mar 5, 2025

Choose a reason for hiding this comment

nk4rter Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

mircea-c Mar 5, 2025

Choose a reason for hiding this comment

nk4rter Mar 13, 2025

Choose a reason for hiding this comment

mircea-c Mar 19, 2025

Choose a reason for hiding this comment

mircea-c left a comment

Choose a reason for hiding this comment

nk4rter commented Feb 3, 2025 •

edited

Loading

nk4rter Mar 5, 2025 •

edited

Loading

nk4rter Mar 5, 2025 •

edited

Loading