Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway silently dies #2380

Open
Fyko opened this issue Oct 16, 2024 · 11 comments
Open

Gateway silently dies #2380

Fyko opened this issue Oct 16, 2024 · 11 comments

Comments

@Fyko
Copy link

Fyko commented Oct 16, 2024

Hello there! I'm experencing an issue with my bot where it will randomly die and no longer print heartbeat ack logs.
This is how my gateway is defined: https://github.com/Fyko/run.sh/blob/main/src/main.rs#L85

Is this a problem of my own doing?

image

@Fyko Fyko changed the title Gateway dies silently Gateway silently dies Oct 16, 2024
@Erk-
Copy link
Member

Erk- commented Oct 16, 2024

Do you have any other logs? The log in the screenshot is not very useful to figure out the problem here.

@Fyko
Copy link
Author

Fyko commented Oct 16, 2024

unfortunately, those are the only logs i have. this is the only other thing in that list of acks. i can turn on trace logging and see if anything new happens
image

@Gelbpunkt
Copy link
Member

I see you're using 0.15. The gateway has seen a lot of bugfixes and changes in 0.16, so please try the release candidate and see if that changes anything for you.

@Fyko
Copy link
Author

Fyko commented Oct 16, 2024

I'll give 0.16 a shot. Here are the tracing logs, for what its worth.
https://paste.fyko.net/uluxepumug.log

@z0w13
Copy link

z0w13 commented Feb 8, 2025

So I did just experience this again, gateway just silently stops receiving messages, implementation here

It did coincide with an uptime healthchecks on other services timing out, so I suspect a possible blip in either the ISP connection, or something with the machine the services are running on.

2025-02-08T19:26:03.510381Z DEBUG tulpje_gateway: event sent uuid=0194e706-b2f6-76e2-8e77-869a659270df shard=0
2025-02-08T19:26:09.142366Z DEBUG tulpje_gateway: event sent uuid=0194e706-c8f6-7430-9ab2-d1aa45867445 shard=0
2025-02-08T19:26:09.142441Z DEBUG tulpje_gateway: event sent uuid=0194e706-c8f6-7430-9ab2-d1b05b469338 shard=0
2025-02-08T19:26:13.569345Z DEBUG tulpje_gateway: event sent uuid=0194e706-da41-7a83-abc1-71770e59a35b shard=0

Have increased log levels for twilight_gateway and my own code to see if that adds more info, have also added more trace logging in my own code to exclude the potential that something there is holding everything up

@vilgotf
Copy link
Member

vilgotf commented Feb 9, 2025

Have increased log levels for twilight_gateway and my own code to see if that adds more info, have also added more trace logging in my own code to exclude the potential that something there is holding everything up

It sounds to me like the WebSocket closing procedure may have gotten stuck (as the healtchecks timed out). I fear you may have to look at the underlying WebSocket/TCP traffic instead because Twilight should detect and log missed heartbeats unless the shard is disconnected. When disconnected, the WebSocket connection may still be active (but in the process of closing), and the shard will only reconnect (open a new WebSocket connection) when the previous one is fully terminated. See https://github.com/twilight-rs/twilight/blob/main/twilight-gateway/src/shard.rs#L923-L931 and

ShardState::Disconnected { reconnect_attempts } if self.connection.is_none() => {

@vilgotf
Copy link
Member

vilgotf commented Feb 9, 2025

Also what TLS backend are you using (rustls/Windows/openSSL) and have you noticed increased CPU usage when the shard gets stuck?

@z0w13
Copy link

z0w13 commented Feb 9, 2025

As far as TLS backend goes (running in Docker scratch containers, so):

twilight-gateway = { version = "0.16.0", features = ["rustls-ring", "rustls-webpki-roots" ], default-features = false }

@Erk- also suggested Redis potentially causing the event loop to stall, so that's another avenue that my current trace logging should reveal.

When it does stall again and it's not something else causing the loop to stall, would netstat reveal if there's any (sort of) open connection to Discord still?

@vilgotf
Copy link
Member

vilgotf commented Feb 9, 2025

Btw the reason I asked about CPU usage is because you might have run into an infinite loop due to a workaround for discord/discord-api-docs#6011, which I'm now somewhat suspect of being related to this issue.

@Erk- also suggested Redis potentially causing the event loop to stall, so that's another avenue that my current trace logging should reveal.

Maybe, although I would imagine the Redis connection to generate some kind of error instead of just stalling.

When it does stall again and it's not something else causing the loop to stall, would netstat reveal if there's any (sort of) open connection to Discord still?

After quickly Googling around I'm not sure... Capturing the network traffic (e.g. Wireshark) should provide a clearer picture (as long as you capture the TLS keys, which I found a guide for here https://users.rust-lang.org/t/support-for-sslkeylogfile-in-https-client-from-reqwest/47616). I can help analyze captured data if you want.

@vilgotf
Copy link
Member

vilgotf commented Feb 9, 2025

Oh right, setting up Wireshark capturing would require you to patch Twilight... Still I think that's the easiest way to troubleshoot this.

@z0w13
Copy link

z0w13 commented Feb 11, 2025

Ok so update after adding extra trace statements that turned out to not be necessary for now, it seems we just, freeze up after twilight_gateway::shard: connection is failed or "zombied"

log lines before it

2025-02-11T17:48:47.435895Z TRACE tulpje_gateway: loop start
2025-02-11T17:48:49.339198Z DEBUG shard{id=[0, 1]}: twilight_gateway::shard: received dispatch event_type=PRESENCE_UPDATE sequence=5017
2025-02-11T17:48:49.339223Z TRACE tulpje_gateway: parsing opcode
2025-02-11T17:48:49.339227Z TRACE tulpje_gateway: opcode received opcode=Dispatch
2025-02-11T17:48:49.339230Z TRACE tulpje_gateway: metrics
2025-02-11T17:48:49.339254Z TRACE tulpje_gateway: updating metrics
2025-02-11T17:48:49.339264Z TRACE tulpje_gateway: updating shard state
2025-02-11T17:48:49.339270Z TRACE tulpje_gateway: metrics done
2025-02-11T17:48:49.339272Z TRACE tulpje_gateway: dispatch
2025-02-11T17:48:49.339281Z TRACE tulpje_gateway::amqp: sending amqp message content_size=311
2025-02-11T17:48:49.339295Z DEBUG tulpje_gateway: event sent uuid=0194f620-c13b-7c33-b5a6-278086dafc44 shard=0
2025-02-11T17:48:49.339300Z TRACE tulpje_gateway: dispatch and loop done
2025-02-11T17:48:49.339302Z TRACE tulpje_gateway: loop start
2025-02-11T17:49:03.818751Z DEBUG shard{id=[0, 1]}: twilight_gateway::shard: sending heartbeat
2025-02-11T17:49:45.068747Z  INFO shard{id=[0, 1]}: twilight_gateway::shard: connection is failed or "zombied"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants