Gateway silently dies #2380

Fyko · 2024-10-16T02:59:33Z

Hello there! I'm experencing an issue with my bot where it will randomly die and no longer print heartbeat ack logs.
This is how my gateway is defined: https://github.com/Fyko/run.sh/blob/main/src/main.rs#L85

Is this a problem of my own doing?

Erk- · 2024-10-16T03:12:35Z

Do you have any other logs? The log in the screenshot is not very useful to figure out the problem here.

Fyko · 2024-10-16T03:15:06Z

unfortunately, those are the only logs i have. this is the only other thing in that list of acks. i can turn on trace logging and see if anything new happens

Gelbpunkt · 2024-10-16T10:09:28Z

I see you're using 0.15. The gateway has seen a lot of bugfixes and changes in 0.16, so please try the release candidate and see if that changes anything for you.

Fyko · 2024-10-16T21:15:46Z

I'll give 0.16 a shot. Here are the tracing logs, for what its worth.
https://paste.fyko.net/uluxepumug.log

z0w13 · 2025-02-08T21:48:35Z

So I did just experience this again, gateway just silently stops receiving messages, implementation here

It did coincide with an uptime healthchecks on other services timing out, so I suspect a possible blip in either the ISP connection, or something with the machine the services are running on.

2025-02-08T19:26:03.510381Z DEBUG tulpje_gateway: event sent uuid=0194e706-b2f6-76e2-8e77-869a659270df shard=0
2025-02-08T19:26:09.142366Z DEBUG tulpje_gateway: event sent uuid=0194e706-c8f6-7430-9ab2-d1aa45867445 shard=0
2025-02-08T19:26:09.142441Z DEBUG tulpje_gateway: event sent uuid=0194e706-c8f6-7430-9ab2-d1b05b469338 shard=0
2025-02-08T19:26:13.569345Z DEBUG tulpje_gateway: event sent uuid=0194e706-da41-7a83-abc1-71770e59a35b shard=0

Have increased log levels for twilight_gateway and my own code to see if that adds more info, have also added more trace logging in my own code to exclude the potential that something there is holding everything up

vilgotf · 2025-02-09T19:54:06Z

Have increased log levels for twilight_gateway and my own code to see if that adds more info, have also added more trace logging in my own code to exclude the potential that something there is holding everything up

It sounds to me like the WebSocket closing procedure may have gotten stuck (as the healtchecks timed out). I fear you may have to look at the underlying WebSocket/TCP traffic instead because Twilight should detect and log missed heartbeats unless the shard is disconnected. When disconnected, the WebSocket connection may still be active (but in the process of closing), and the shard will only reconnect (open a new WebSocket connection) when the previous one is fully terminated. See https://github.com/twilight-rs/twilight/blob/main/twilight-gateway/src/shard.rs#L923-L931 and

twilight/twilight-gateway/src/shard.rs

Line 823 in 5e4bab5

    
           ShardState::Disconnected { reconnect_attempts } if self.connection.is_none() => {

vilgotf · 2025-02-09T20:01:11Z

Also what TLS backend are you using (rustls/Windows/openSSL) and have you noticed increased CPU usage when the shard gets stuck?

z0w13 · 2025-02-09T20:05:41Z

As far as TLS backend goes (running in Docker scratch containers, so):

twilight-gateway = { version = "0.16.0", features = ["rustls-ring", "rustls-webpki-roots" ], default-features = false }

@Erk- also suggested Redis potentially causing the event loop to stall, so that's another avenue that my current trace logging should reveal.

When it does stall again and it's not something else causing the loop to stall, would netstat reveal if there's any (sort of) open connection to Discord still?

vilgotf · 2025-02-09T20:28:24Z

Btw the reason I asked about CPU usage is because you might have run into an infinite loop due to a workaround for discord/discord-api-docs#6011, which I'm now somewhat suspect of being related to this issue.

@Erk- also suggested Redis potentially causing the event loop to stall, so that's another avenue that my current trace logging should reveal.

Maybe, although I would imagine the Redis connection to generate some kind of error instead of just stalling.

When it does stall again and it's not something else causing the loop to stall, would netstat reveal if there's any (sort of) open connection to Discord still?

After quickly Googling around I'm not sure... Capturing the network traffic (e.g. Wireshark) should provide a clearer picture (as long as you capture the TLS keys, which I found a guide for here https://users.rust-lang.org/t/support-for-sslkeylogfile-in-https-client-from-reqwest/47616). I can help analyze captured data if you want.

vilgotf · 2025-02-09T20:33:13Z

Oh right, setting up Wireshark capturing would require you to patch Twilight... Still I think that's the easiest way to troubleshoot this.

z0w13 · 2025-02-11T18:02:57Z

Ok so update after adding extra trace statements that turned out to not be necessary for now, it seems we just, freeze up after twilight_gateway::shard: connection is failed or "zombied"

log lines before it

2025-02-11T17:48:47.435895Z TRACE tulpje_gateway: loop start
2025-02-11T17:48:49.339198Z DEBUG shard{id=[0, 1]}: twilight_gateway::shard: received dispatch event_type=PRESENCE_UPDATE sequence=5017
2025-02-11T17:48:49.339223Z TRACE tulpje_gateway: parsing opcode
2025-02-11T17:48:49.339227Z TRACE tulpje_gateway: opcode received opcode=Dispatch
2025-02-11T17:48:49.339230Z TRACE tulpje_gateway: metrics
2025-02-11T17:48:49.339254Z TRACE tulpje_gateway: updating metrics
2025-02-11T17:48:49.339264Z TRACE tulpje_gateway: updating shard state
2025-02-11T17:48:49.339270Z TRACE tulpje_gateway: metrics done
2025-02-11T17:48:49.339272Z TRACE tulpje_gateway: dispatch
2025-02-11T17:48:49.339281Z TRACE tulpje_gateway::amqp: sending amqp message content_size=311
2025-02-11T17:48:49.339295Z DEBUG tulpje_gateway: event sent uuid=0194f620-c13b-7c33-b5a6-278086dafc44 shard=0
2025-02-11T17:48:49.339300Z TRACE tulpje_gateway: dispatch and loop done
2025-02-11T17:48:49.339302Z TRACE tulpje_gateway: loop start
2025-02-11T17:49:03.818751Z DEBUG shard{id=[0, 1]}: twilight_gateway::shard: sending heartbeat
2025-02-11T17:49:45.068747Z  INFO shard{id=[0, 1]}: twilight_gateway::shard: connection is failed or "zombied"

Fyko changed the title ~~Gateway dies silently~~ Gateway silently dies Oct 16, 2024

Erk- mentioned this issue Feb 14, 2025

fix(stream): poll_close could in some cases block indefenitely Gelbpunkt/tokio-websockets#102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway silently dies #2380

Gateway silently dies #2380

Fyko commented Oct 16, 2024

Erk- commented Oct 16, 2024

Fyko commented Oct 16, 2024

Gelbpunkt commented Oct 16, 2024

Fyko commented Oct 16, 2024

z0w13 commented Feb 8, 2025 •

edited

Loading

vilgotf commented Feb 9, 2025

vilgotf commented Feb 9, 2025 •

edited

Loading

z0w13 commented Feb 9, 2025

vilgotf commented Feb 9, 2025

vilgotf commented Feb 9, 2025

z0w13 commented Feb 11, 2025 •

edited

Loading

Gateway silently dies #2380

Gateway silently dies #2380

Comments

Fyko commented Oct 16, 2024

Erk- commented Oct 16, 2024

Fyko commented Oct 16, 2024

Gelbpunkt commented Oct 16, 2024

Fyko commented Oct 16, 2024

z0w13 commented Feb 8, 2025 • edited Loading

vilgotf commented Feb 9, 2025

vilgotf commented Feb 9, 2025 • edited Loading

z0w13 commented Feb 9, 2025

vilgotf commented Feb 9, 2025

vilgotf commented Feb 9, 2025

z0w13 commented Feb 11, 2025 • edited Loading

z0w13 commented Feb 8, 2025 •

edited

Loading

vilgotf commented Feb 9, 2025 •

edited

Loading

z0w13 commented Feb 11, 2025 •

edited

Loading