Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass messages from network crate to managers #147

Draft
wants to merge 4 commits into
base: unstable
Choose a base branch
from

Conversation

dknopik
Copy link
Member

@dknopik dknopik commented Feb 18, 2025

Passes messages to application by converting into messages understood by the QBFT and signing manager, and directly calling their functions meant for receiving network messages. These functions queue the messages for consumption by the corresponding long running tasks.

There is also some infrastructure created in this PR:

  • CommitteeId
  • Partial Signature Message

It includes the changes from #137, and therefore supersedes it.

@dknopik dknopik marked this pull request as ready for review February 19, 2025 08:19
@dknopik dknopik added ready-for-review This PR is ready to be reviewed network labels Feb 19, 2025
let network = Network::try_new(
&config.network,
subnet_tracker,
qbft_manager.clone(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the network should be aware of the QBFT Manager? Could we communicate between them using a channel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectively, that's what the QBFT manager struct does. selecting the correct qbft instance and sending a message to it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be achieved by decoupling the network and the manager and establishing communication between them using a single channel, through which all messages are routed within the manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more channel means one more channel that must do one of the following if it can't keep up:

  • grow unbounded
  • block
  • drop messages

not sure if that is worth it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good points, but let's pause this for a while. I noticed we're using unbounded channels for both per-instance communication and the network transmission. As in production it could lead to memory issues, I was thinking that switching to bounded channels might be a good idea. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the coupling:

  1. Direct method calls between Network → QbftManager create tight coupling
  2. Makes it difficult to test network layer without QBFT logic

Central Channel-Based Solution:

Network(channel)QbftManager(per-instance channels)QBFT instances

Benefits:

  1. Enables testing network in isolation by:

    • Mocking the output channel

    • Verifying sent messages without QBFT dependencies

  2. Allows testing QBFT manager with:

    • Mocked network messages
  3. Clearer boundaries between layers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think deeper about it, but right now it seems to me the core challenge here isn't using a central channel, but what to do when a qbft instance's bounded channel is full.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think deeper about it, but right now it seems to me the core challenge here isn't using a central channel, but what to do when a qbft instance's bounded channel is full.

We should discard the message. And that's fine. If it's happening due to temporary resource constraints, we can recover after (and maybe even operate partially). If we block, we can also recover, but block other incoming messages. If we grow unbounded, we crash.

The central channel makes it worse because we introduce another bottleneck.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The central channel makes it worse because we introduce another bottleneck.

Could you elaborate more on how it's a bottleneck? Isn't it only the qbft instances that process messages? If a specific message can't be delivered it's dropped and the qbft instance won't participate in this consensus round.

@@ -36,6 +37,14 @@ impl Cluster {
pub fn get_f(&self) -> u64 {
(self.cluster_members.len().saturating_sub(1) / 3) as u64
}

pub fn committee_id(&self) -> CommitteeId {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to move this to the file where CommitteeId is defined? It'd receive the cluster members and return the Id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one holding a Cluster should not need to fiddle with the fields and pass it somewhere - a utility function is clearer at the call site, I think. The actual logic (hashing the operator ids) is already contained in the committee.rs file.

Copy link

@diegomrsantos diegomrsantos Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not important, but sth like CommitteeId::from(&self.cluster_members); could be more natural. But I see, your motivation with this function was to make it even less verbose at the caller.

#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Hash, From, Deref)]
pub struct CommitteeId(pub [u8; COMMITTEE_ID_LEN]);

impl From<Vec<OperatorId>> for CommitteeId {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd like to simplify the caller code we could implement sth like:

impl From<&IndexSet<OperatorId>> for CommitteeId {
    fn from(cluster_members: &IndexSet<OperatorId>) -> Self {
        let mut sorted: Vec<_> = cluster_members.iter().copied().collect();
        sorted.sort();
        let mut data: Vec<u8> = Vec::with_capacity(sorted.len() * 4);
        for id in sorted {
            data.extend_from_slice(&id.to_le_bytes());
        }
        keccak256(data).0.into()
    }
}

Then call let committee_id: CommitteeId = (&self.cluster_members).into();

@dknopik
Copy link
Member Author

dknopik commented Feb 19, 2025

Thinking about bottlenecks makes me reconsider validation again. What do you think about moving all validation behind the QBFT manager queues to parallelize them? and only do the most basic validation (e.g. is the message of interest to us) before?

}
}

fn on_consensus_message_received(&mut self, message: SignedSSVMessage) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be moved to the qbft manager?

@diegomrsantos
Copy link

Thinking about bottlenecks makes me reconsider validation again. What do you think about moving all validation behind the QBFT manager queues to parallelize them? and only do the most basic validation (e.g. is the message of interest to us) before?

Seems a great idea!

@dknopik dknopik marked this pull request as draft February 20, 2025 12:59
@dknopik dknopik removed the ready-for-review This PR is ready to be reviewed label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants