Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Side load of fully-signed snapshot #1858

Open
3 tasks done
noonio opened this issue Feb 18, 2025 · 6 comments · May be fixed by #1864
Open
3 tasks done

Side load of fully-signed snapshot #1858

noonio opened this issue Feb 18, 2025 · 6 comments · May be fixed by #1864
Assignees
Labels
amber ⚠️ Medium complexity or partly unclear feature 💭 idea An idea or feature request

Comments

@noonio
Copy link
Contributor

noonio commented Feb 18, 2025

Todo:

  • Refine this description and potential solutions
  • Define a scenario that we can use to test solutions
    • [] Need to find out how to construct these "diverging views" and how to resolve (pumba sets? Maybe, if any still fail after raft!)
  • Understand why this is a better solution than Clear pending transactions API command #1284 (i.e. doesn't fall into the same trap)

Description

Processing transactions in a Hydra head requires each node to agree on transactions. The protocol will validate transactions (on NewTx command) against it's local view of the ledger state, using the passed --ledger-protocol-parameters. As transactions can be valid or invalid based on configuration (or to some extent exact build versions of hydra-node), it is possible that one node accepts a transaction, while the peer nodes do not.

Currently, this means that the node which accepted the transaction now has a different local state than the other nodes and might try to spend outputs that other nodes don't see available. For example, when using hydraw, the node would be using outputs introduced by the previous pixel paint transaction, but other nodes will deem any new transaction invalid with a BadInputs error.

Within this feature, we want to improve the UX of hydra-node in presence of such misalignments.

Note: We should only adopt snapshots that are enforceable on L1.

Suggested solution

  • Allow adoption of a new snapshot This snapshot has to be:

    • the same for everyone, and to have a snapshot number and version strictly bigger than previous.
    • enforceable and valid against the current state of the protocol on L1:
      • must be signed by everyone (somehow).
  • Allow introspection of the current snapshot in a particular node

    • We want to be able to notice if the head has become stuck. The why might be tricky but the whom would be sufficient for the time being.
    • We want to be able to observed who is missing to sign the current snapshot in flight (which is preventing from getting it confirmed)
    • Having flow metrics would not help in this scenario, given we can face a use-case where a single tx its being rejected by one of the peers and this does not depend on volume.
  • Work out what constraints are required to accept a new snapshot

What

In this setting, every peer has to manually cooperate by posting a command to reset their local state to a previous snapshot confirmed.

Scenarios

  1. A configuration discrepancy (like: --ledger-protocol-parameters) arises after Head is open, examples maxTxSize/maxTxExecutionUnits could be good for for the first NewTx but too much for the second, making a peer come in disagreement.

    • this also requires the miss-configured peer to reset its node and fix their config to be aligned with the rest of the party.
  2. Having a peer going offline for too long and missing to catchup or resending AckSn.

Additional context

Compared to #1284, the solution does not depend on how long a peer becomes offline.
Here when a peer becomes back online, the networking layer will make sure to catch up the reconnecting peer, but if someone clear its pending txs while doing so, will create a worse scenario, where parties end up in different confirmed snapshots.

@noonio noonio added the 💭 idea An idea or feature request label Feb 18, 2025
@noonio noonio added the red 💣 💥 ⁉️ Very complex, risky or just not well understood feature label Feb 18, 2025
@ffakenz ffakenz changed the title Allow parties to adopt a fully-signed snapshot Side load of fully-signed snapshot Feb 19, 2025
@ffakenz ffakenz added amber ⚠️ Medium complexity or partly unclear feature and removed red 💣 💥 ⁉️ Very complex, risky or just not well understood feature labels Feb 19, 2025
@noonio noonio moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Feb 20, 2025
@noonio noonio moved this from In progress 🕐 to Todo 📋 in ☕ Hydra Team Work Feb 20, 2025
@ffakenz ffakenz linked a pull request Feb 25, 2025 that will close this issue
4 tasks
@noonio noonio moved this from Todo 📋 to In progress 🕐 in ☕ Hydra Team Work Feb 28, 2025
@noonio
Copy link
Contributor Author

noonio commented Mar 3, 2025

As discussed, there are two quirks here:

  1. The networking side is resolved, basically, by the Replace network with etcd #1854 branch. So we won't try and replicate that here (even though we could maybe attempt it on top of Investigate resiliance of hydra when one node goes offline and a tx is submitted #1792 ).
  2. The protocol parameter changes were not quite found to be as impactful because of details of how we write the persistent when detecting that situation ( cc @ch1bo )

That said, I think there are still two ways/reasons to proceed here:

  1. This feature is a nice "escape hatch" if there does ever happen to be a stuck head. We can't know all the ways this can happen, but it may be useful to have this as a debugging/hot-fixing tool for when we encounter it.
  2. It should still be possible to write a BehaviourSpec test that simply rejects Tx's (i.e. mocking out invalid protocol parameters) and then still allows a snapshot to be loaded.

Regarding this final point, @ffakenz do you want to take a shot at writing up the gist of a test that would do that? And we can at least see if we can get to a failure that this feature could resolve?

@GeorgeFlerovsky
Copy link

GeorgeFlerovsky commented Mar 5, 2025

@ch1bo @noonio
I thought about this problem recently while writing the Hydrozoa spec. My solution is to no longer have the non-leader peers maintain unconfirmed local ledger states.

Specifically:

  • When a transaction is broadcast via ReqTx, all non-leader peers simply cache it in the mempool without verifying. Only the next block's leader verifies it, updating the leader's unconfirmed ledger state.

  • When the current block obtains all signatures, the next block is broadcast by its leader as usual, but the block's contents are slightly different. The block affirms one list of transactions and rejects another list of transactions.

  • The non-leader peers only start validating transactions when they receive the next block from the leader. Verification proceeds as follows:

    1. Initialize a temporary ledger state, setting it equal to the confirmed ledger state.
    2. Apply each affirmed tx to the temporary ledger state.
    3. Check that each rejected tx is invalid relative the last ledger state reached in (2).
  • If block verification succeeds, the peer updates its local confirmed ledger state, removes the block's affirmed and rejected txs from mempool, and broadcasts the block signature.

The advantages of this approach are that peers' local ledger states never diverge and peers' mempools are regularly flushed of stale/invalid txs.

@GeorgeFlerovsky
Copy link

GeorgeFlerovsky commented Mar 5, 2025

I think that Hydra's current approach of peers maintaining local unconfirmed ledger states is a redundant remnant from Cardano L1.

On L1, every peer applies every tx to its local ledger state as soon as it is received. This is done because the peer needs to know immediately whether the tx is valid and should be gossiped further or the tx is invalid and should be suppressed.

In Hydra's L2 protocol, every tx is directly broadcast to all peers regardless of its validity, so there's no need to immediately decide whether it is valid.

@ch1bo
Copy link
Member

ch1bo commented Mar 11, 2025

@GeorgeFlerovsky Good points and some of them cross my mind too. The current implementation has not departed from the original paper because of dubious fear in accidentally breaking the original consensus protocol. Maybe it's time to be more brave and make things better by-design as you suggest.

I know where the need for a local ledger view comes from though: in the original design, there was not a round-robin leader, but anyone could propose snapshots - as often or rarely as they want. For this, each participant would maintain a current view of the world, which is especially important if snapshots are not done after each tx. In summary, we departed in two ways from this:

  1. coordinated leadership
  2. snapshots every transaction

I like the mempool way of putting things, as it would hint at specifying the off-chain protocol in a more robust-by-design way of propagating information (i.e. not using a reliable multicast assumption) and separates "diffusion" of transactions and snapshots. This follows more the Cardano design and we could leverage similar pull-based networking. But before we could do that, I would want to have a protocol written up this way first (I asked Matthias Fitzi a couple months ago about it, but priorities changed).

@GeorgeFlerovsky
Copy link

GeorgeFlerovsky commented Mar 11, 2025

@ch1bo Yup, makes sense.

As a starting point for your protocol writeup, please take a look at how we described the offchain consensus protocol in the Hydrozoa spec (§ 5):

We'll likely have a newer version in the next two weeks (incorporating feedback from ~120 comments in the discussion), but it should already give you a clear idea of our protocol.

If you or your team have any feedback/questions, please comment in the same discussion.

@GeorgeFlerovsky
Copy link

GeorgeFlerovsky commented Mar 11, 2025

I like the mempool way of putting things, as it would hint at specifying the off-chain protocol in a more robust-by-design way of propagating information (i.e. not using a reliable multicast assumption) and separates "diffusion" of transactions and snapshots. This follows more the Cardano design and we could leverage similar pull-based networking.

Theoretically, ReqTx could be sent only to the next K snapshot leaders (K > 1 for robustness) instead of all peers. The snapshot leader who affirms/rejects the transaction would then include the whole transaction in the snapshot, not just the tx hash.

However, I think it's more optimal for the transaction submitter to multicast ReqTx to all peers, for these reasons:

  1. It avoids an extra hop (Submitter -> Snapshot leader -> Peers) for each L2 transaction.
  2. L2 transactions are multicast to all peers in parallel without waiting for the next snapshot to be created.

not using a reliable multicast assumption

Indeed, if we drop the reliable multicast assumption for ReqTx, then any peer that reaches timeout before receiving an L2 transaction mentioned by a snapshot can request the missing L2 transactions from the snapshot leader.

However, we'll keep the assumption in Hydrozoa for now, as I'm sure that many other nuances will arise if we drop it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amber ⚠️ Medium complexity or partly unclear feature 💭 idea An idea or feature request
Projects
Status: In review 👀
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants