Side load of fully-signed snapshot #1858

noonio · 2025-02-18T13:27:39Z

Todo:

Refine this description and potential solutions
Define a scenario that we can use to test solutions
- [] ~~Need to find out how to construct these "diverging views" and how to resolve (pumba sets? Maybe, if any still fail after raft!)~~
Understand why this is a better solution than Clear pending transactions API command #1284 (i.e. doesn't fall into the same trap)

Description

Processing transactions in a Hydra head requires each node to agree on transactions. The protocol will validate transactions (on NewTx command) against it's local view of the ledger state, using the passed --ledger-protocol-parameters. As transactions can be valid or invalid based on configuration (or to some extent exact build versions of hydra-node), it is possible that one node accepts a transaction, while the peer nodes do not.

Currently, this means that the node which accepted the transaction now has a different local state than the other nodes and might try to spend outputs that other nodes don't see available. For example, when using hydraw, the node would be using outputs introduced by the previous pixel paint transaction, but other nodes will deem any new transaction invalid with a BadInputs error.

Within this feature, we want to improve the UX of hydra-node in presence of such misalignments.

Note: We should only adopt snapshots that are enforceable on L1.

What

In this setting, every peer has to manually cooperate by posting a command to reset their local state to a previous snapshot confirmed.

Scenarios

A configuration discrepancy (like: --ledger-protocol-parameters) arises after Head is open, examples maxTxSize/maxTxExecutionUnits could be good for for the first NewTx but too much for the second, making a peer come in disagreement.
- this also requires the miss-configured peer to reset its node and fix their config to be aligned with the rest of the party.
Having a peer going offline for too long and missing to catchup or resending AckSn.

Additional context

Compared to #1284, the solution does not depend on how long a peer becomes offline.
Here when a peer becomes back online, the networking layer will make sure to catch up the reconnecting peer, but if someone clear its pending txs while doing so, will create a worse scenario, where parties end up in different confirmed snapshots.

The text was updated successfully, but these errors were encountered:

noonio · 2025-03-03T15:01:39Z

As discussed, there are two quirks here:

The networking side is resolved, basically, by the Replace network with etcd #1854 branch. So we won't try and replicate that here (even though we could maybe attempt it on top of Investigate resiliance of hydra when one node goes offline and a tx is submitted #1792 ).
The protocol parameter changes were not quite found to be as impactful because of details of how we write the persistent when detecting that situation ( cc @ch1bo )

That said, I think there are still two ways/reasons to proceed here:

This feature is a nice "escape hatch" if there does ever happen to be a stuck head. We can't know all the ways this can happen, but it may be useful to have this as a debugging/hot-fixing tool for when we encounter it.
It should still be possible to write a BehaviourSpec test that simply rejects Tx's (i.e. mocking out invalid protocol parameters) and then still allows a snapshot to be loaded.

Regarding this final point, @ffakenz do you want to take a shot at writing up the gist of a test that would do that? And we can at least see if we can get to a failure that this feature could resolve?

GeorgeFlerovsky · 2025-03-05T19:41:04Z

@ch1bo @noonio
I thought about this problem recently while writing the Hydrozoa spec. My solution is to no longer have the non-leader peers maintain unconfirmed local ledger states.

Specifically:

When a transaction is broadcast via ReqTx, all non-leader peers simply cache it in the mempool without verifying. Only the next block's leader verifies it, updating the leader's unconfirmed ledger state.
When the current block obtains all signatures, the next block is broadcast by its leader as usual, but the block's contents are slightly different. The block affirms one list of transactions and rejects another list of transactions.
The non-leader peers only start validating transactions when they receive the next block from the leader. Verification proceeds as follows:
1. Initialize a temporary ledger state, setting it equal to the confirmed ledger state.
2. Apply each affirmed tx to the temporary ledger state.
3. Check that each rejected tx is invalid relative the last ledger state reached in (2).
If block verification succeeds, the peer updates its local confirmed ledger state, removes the block's affirmed and rejected txs from mempool, and broadcasts the block signature.

The advantages of this approach are that peers' local ledger states never diverge and peers' mempools are regularly flushed of stale/invalid txs.

GeorgeFlerovsky · 2025-03-05T19:48:55Z

I think that Hydra's current approach of peers maintaining local unconfirmed ledger states is a redundant remnant from Cardano L1.

On L1, every peer applies every tx to its local ledger state as soon as it is received. This is done because the peer needs to know immediately whether the tx is valid and should be gossiped further or the tx is invalid and should be suppressed.

In Hydra's L2 protocol, every tx is directly broadcast to all peers regardless of its validity, so there's no need to immediately decide whether it is valid.

ch1bo · 2025-03-11T08:40:37Z

@GeorgeFlerovsky Good points and some of them cross my mind too. The current implementation has not departed from the original paper because of dubious fear in accidentally breaking the original consensus protocol. Maybe it's time to be more brave and make things better by-design as you suggest.

I know where the need for a local ledger view comes from though: in the original design, there was not a round-robin leader, but anyone could propose snapshots - as often or rarely as they want. For this, each participant would maintain a current view of the world, which is especially important if snapshots are not done after each tx. In summary, we departed in two ways from this:

coordinated leadership
snapshots every transaction

I like the mempool way of putting things, as it would hint at specifying the off-chain protocol in a more robust-by-design way of propagating information (i.e. not using a reliable multicast assumption) and separates "diffusion" of transactions and snapshots. This follows more the Cardano design and we could leverage similar pull-based networking. But before we could do that, I would want to have a protocol written up this way first (I asked Matthias Fitzi a couple months ago about it, but priorities changed).

GeorgeFlerovsky · 2025-03-11T21:43:17Z

@ch1bo Yup, makes sense.

As a starting point for your protocol writeup, please take a look at how we described the offchain consensus protocol in the Hydrozoa spec (§ 5):

We'll likely have a newer version in the next two weeks (incorporating feedback from ~120 comments in the discussion), but it should already give you a clear idea of our protocol.

If you or your team have any feedback/questions, please comment in the same discussion.

GeorgeFlerovsky · 2025-03-11T22:20:54Z

I like the mempool way of putting things, as it would hint at specifying the off-chain protocol in a more robust-by-design way of propagating information (i.e. not using a reliable multicast assumption) and separates "diffusion" of transactions and snapshots. This follows more the Cardano design and we could leverage similar pull-based networking.

Theoretically, ReqTx could be sent only to the next K snapshot leaders (K > 1 for robustness) instead of all peers. The snapshot leader who affirms/rejects the transaction would then include the whole transaction in the snapshot, not just the tx hash.

However, I think it's more optimal for the transaction submitter to multicast ReqTx to all peers, for these reasons:

It avoids an extra hop (Submitter -> Snapshot leader -> Peers) for each L2 transaction.
L2 transactions are multicast to all peers in parallel without waiting for the next snapshot to be created.

not using a reliable multicast assumption

Indeed, if we drop the reliable multicast assumption for ReqTx, then any peer that reaches timeout before receiving an L2 transaction mentioned by a snapshot can request the missing L2 transactions from the snapshot leader.

However, we'll keep the assumption in Hydrozoa for now, as I'm sure that many other nuances will arise if we drop it.

noonio added the 💭 idea An idea or feature request label Feb 18, 2025

github-project-automation bot added this to ☕ Hydra Team Work Feb 18, 2025

github-project-automation bot moved this to Triage 🏥 in ☕ Hydra Team Work Feb 18, 2025

noonio added the red 💣 💥 ⁉️ Very complex, risky or just not well understood feature label Feb 18, 2025

noonio assigned ffakenz Feb 18, 2025

ffakenz changed the title ~~Allow parties to adopt a fully-signed snapshot~~ Side load of fully-signed snapshot Feb 19, 2025

ffakenz added amber ⚠️ Medium complexity or partly unclear feature and removed red 💣 💥 ⁉️ Very complex, risky or just not well understood feature labels Feb 19, 2025

noonio moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Feb 20, 2025

noonio moved this from In progress 🕐 to Todo 📋 in ☕ Hydra Team Work Feb 20, 2025

ffakenz linked a pull request Feb 25, 2025 that will close this issue

Side load snapshot #1864

Open

4 tasks

noonio added this to 🚢 Hydra Head Roadmap Feb 26, 2025

noonio moved this from Todo 📋 to In progress 🕐 in ☕ Hydra Team Work Feb 28, 2025

GeorgeFlerovsky mentioned this issue Mar 8, 2025

Heads stuck in a state without being able to progress snapshots #1773

Open

noonio moved this from In progress 🕐 to In review 👀 in ☕ Hydra Team Work Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Side load of fully-signed snapshot #1858

Side load of fully-signed snapshot #1858

noonio commented Feb 18, 2025 •

edited by ffakenz

Loading

noonio commented Mar 3, 2025

GeorgeFlerovsky commented Mar 5, 2025 •

edited

Loading

GeorgeFlerovsky commented Mar 5, 2025 •

edited

Loading

ch1bo commented Mar 11, 2025

GeorgeFlerovsky commented Mar 11, 2025 •

edited

Loading

GeorgeFlerovsky commented Mar 11, 2025 •

edited

Loading

Side load of fully-signed snapshot #1858

Side load of fully-signed snapshot #1858

Comments

noonio commented Feb 18, 2025 • edited by ffakenz Loading

Description

Suggested solution

What

Scenarios

Additional context

noonio commented Mar 3, 2025

GeorgeFlerovsky commented Mar 5, 2025 • edited Loading

GeorgeFlerovsky commented Mar 5, 2025 • edited Loading

ch1bo commented Mar 11, 2025

GeorgeFlerovsky commented Mar 11, 2025 • edited Loading

GeorgeFlerovsky commented Mar 11, 2025 • edited Loading

noonio commented Feb 18, 2025 •

edited by ffakenz

Loading

GeorgeFlerovsky commented Mar 5, 2025 •

edited

Loading

GeorgeFlerovsky commented Mar 5, 2025 •

edited

Loading

GeorgeFlerovsky commented Mar 11, 2025 •

edited

Loading

GeorgeFlerovsky commented Mar 11, 2025 •

edited

Loading