Etcd networking gets into invalid state after downtime of one peer/after some time #1879

noonio · 2025-03-07T10:23:47Z

Context & versions

On present master, we observed that the etcd networking seems to fail and is only resolved by restarting the node (or perhaps even the entire computer).

One potential idea is that it is related to some leak somewhere between the hydra-node and etcd itself (i.e. our logs, or the grpc interface).

Some trivia

Leaking file descriptors?

On one computer we saw several open things with lsof:

> lsof | grep 'etcd' | wc -l
306

Most of which were TCP-style connections.

Buffer full?

There was this error in the etcd logs:

"message-type
":"MsgHeartbeat","msg":"dropped internal Raft message since sending buffer is full (overloaded network)"

Perhaps it's related?

hydra-node not stopping when signalled with `systemctl stop hydra-node`

It spun for a while not stopping, so I restarted my computer instead, and it came back fine.

The text was updated successfully, but these errors were encountered:

ch1bo · 2025-03-07T12:22:37Z

FWIW we observed this behavior after one node was disconnected for 12+ hours

A few first changes to help debugging connectivity issues we saw in course of #1879. Note that changing the `msg` key is not a (major) breaking change as watching is done using the `msg` prefix and the port parsing in `matchVersion` is defensively done. The version check is bound to change anyways now (not do it on each message!) --- * [x] CHANGELOG update not needed * [x] Documentation update not needed * [x] Haddocks updated * [x] No new TODOs introduced

noonio added the bug 🐛 Something isn't working label Mar 7, 2025

github-project-automation bot moved this to Triage 🏥 in ☕ Hydra Team Work Mar 7, 2025

github-project-automation bot added this to ☕ Hydra Team Work Mar 7, 2025

noonio moved this from Triage 🏥 to Todo 📋 in ☕ Hydra Team Work Mar 10, 2025

noonio changed the title ~~Etcd networking gets into invalid state after downtime of one peer~~ Etcd networking gets into invalid state after downtime of one peer/after some time Mar 10, 2025

ch1bo mentioned this issue Mar 10, 2025

Etcd observability #1884

Merged

4 tasks

noonio mentioned this issue Mar 10, 2025

TUI does not seem to report alive peers correctly #1885

Open

noonio removed the bug 🐛 Something isn't working label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd networking gets into invalid state after downtime of one peer/after some time #1879

Etcd networking gets into invalid state after downtime of one peer/after some time #1879

noonio commented Mar 7, 2025 •

edited

Loading

ch1bo commented Mar 7, 2025

Etcd networking gets into invalid state after downtime of one peer/after some time #1879

Etcd networking gets into invalid state after downtime of one peer/after some time #1879

Comments

noonio commented Mar 7, 2025 • edited Loading

Context & versions

Some trivia

Leaking file descriptors?

Buffer full?

hydra-node not stopping when signalled with systemctl stop hydra-node

ch1bo commented Mar 7, 2025

noonio commented Mar 7, 2025 •

edited

Loading

hydra-node not stopping when signalled with `systemctl stop hydra-node`