Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd networking gets into invalid state after downtime of one peer/after some time #1879

Open
noonio opened this issue Mar 7, 2025 · 1 comment

Comments

@noonio
Copy link
Contributor

noonio commented Mar 7, 2025

Context & versions

On present master, we observed that the etcd networking seems to fail and is only resolved by restarting the node (or perhaps even the entire computer).

One potential idea is that it is related to some leak somewhere between the hydra-node and etcd itself (i.e. our logs, or the grpc interface).

Some trivia

Leaking file descriptors?

On one computer we saw several open things with lsof:

> lsof | grep 'etcd' | wc -l
306

Most of which were TCP-style connections.

Buffer full?

There was this error in the etcd logs:

"message-type
":"MsgHeartbeat","msg":"dropped internal Raft message since sending buffer is full (overloaded network)"

Perhaps it's related?

hydra-node not stopping when signalled with systemctl stop hydra-node

It spun for a while not stopping, so I restarted my computer instead, and it came back fine.

@noonio noonio added the bug 🐛 Something isn't working label Mar 7, 2025
@github-project-automation github-project-automation bot moved this to Triage 🏥 in ☕ Hydra Team Work Mar 7, 2025
@ch1bo
Copy link
Member

ch1bo commented Mar 7, 2025

FWIW we observed this behavior after one node was disconnected for 12+ hours

@noonio noonio moved this from Triage 🏥 to Todo 📋 in ☕ Hydra Team Work Mar 10, 2025
@noonio noonio changed the title Etcd networking gets into invalid state after downtime of one peer Etcd networking gets into invalid state after downtime of one peer/after some time Mar 10, 2025
@ch1bo ch1bo mentioned this issue Mar 10, 2025
4 tasks
@noonio noonio removed the bug 🐛 Something isn't working label Mar 10, 2025
github-merge-queue bot pushed a commit that referenced this issue Mar 12, 2025
A few first changes to help debugging connectivity issues we saw in
course of #1879.

Note that changing the `msg` key is not a (major) breaking change as
watching is done using the `msg` prefix and the port parsing in
`matchVersion` is defensively done. The version check is bound to change
anyways now (not do it on each message!)

---

* [x] CHANGELOG update not needed
* [x] Documentation update not needed
* [x] Haddocks updated
* [x] No new TODOs introduced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo 📋
Development

No branches or pull requests

2 participants