-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced split-brain protection #51
Comments
From first principles (apologies for teaching readers to suck eggs): In Openfire terms, a split-brain occurs when two (or more) nodes in a cluster both think they are the senior node. E.g. in a two node cluster, the network between the two nodes is lost, neither node can see that the other node is available, so both assume it is in the senior. ref https://en.wikipedia.org/wiki/Split-brain_(computing). A typical solution to this problem is to introduce the concept of a quorum value. A quorum value would be (nodecount/2+1) - e.g. 2 nodes in a 3 node cluster, 3 nodes in a four node cluster, 3 nodes in a 5 node cluster. ref https://en.wikipedia.org/wiki/Quorum_(distributed_computing) So a proposal to implement this woud be: (Note the distinction between a Hazelcast cluster and an Openfire cluster - they may be in different states) If a quorum value is configured, when a node starts, Openfire clustering remains "starting" until the node can see the quorum number of nodes in the Hazelcast cluster. These nodes would then agree on a senior member (currently, it's the oldest member of the cluster, I don't see a need to change that). When a node leaves the Openfire cluster and the number of remaining nodes is less than the quorum value, the remaining node(s) would disable clustering and then immediately re-enable it. Clustering would then, as above, remain "starting" until the node can see the quroum number of nodes in the cluster. Possible further enhancements; |
This seems to trade consistency for availability. I can imagine that there are scenarios in which each of the other is preferred. We'd need to make sure that this behavior is highly configurable. Unless I'm misunderstanding, the suggested approach would basically reduce or remove service from the entire service, when one cluster node fails. My gut feeling says that most deployments would favor to not lock/log off the entire domain in such a scenario, choosing availability over consistency. |
Yes, it is a trade off. Typically you'd need an odd number of nodes, and just under half of them will fail before you lose the whole cluster. But to make it explicit, I was only expecting the above behaviour if a quorum number was set. If no quorum was set, behaviour is as it is today. |
A, right, I misunderstood that. My interpretation was that the entire cluster should grind to a halt when just one node disappears. That's not what you suggested: it's basically when the cluster falls under half-plus-one of the anticipated cluster size. |
As @GregDThomas suggested:
A new enhancement, that would require you to have an odd number of cluster nodes. Basically, assuming three nodes, you have to have two communicating to get a cluster. If you only have one node, you're not clustered.
@GregDThomas: I'm assuming here that the aim of this is to have a resolution where a majority of servers dictates the resulting state?
The text was updated successfully, but these errors were encountered: