Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

Open
2 tasks done
douglerner opened this issue Jan 30, 2025 · 4 comments
Labels

Comments

@douglerner
Copy link

⚠️ Please verify that this question has NOT been raised before.

  • I checked and didn't find similar issue

🛡️ Security Policy

📝 Describe your problem

I have some servers I'm monitoring and sometimes they become non-responsive and I get alerts. Sometimes the cause is the server crashed and cannot be connected to and needs a manual restart. Sometimes there was a segmentation fault crash and it's self-restarting. Sometimes it just goes unresponsive. Sometimes it's responsive, but the keywords aren't present (maybe it's syncing something).

I don't think I have the right balance in the heartbeat interval, retries, heartbeat retry interval, and request timeout settings and I'd like to confirm what they mean and maybe get some advice. For example, the other day I didn't get a notification even though the site was down (in this case not-responsive) for 14 minutes. I knew it was down because of a simple connection alert from another system.

These are my current settings:

  • heartbeat interval: 300 seconds. That just means to test every 5 minutes. I think I get that.
  • retries: 3 - I think that means it will try, wait 5 minutes, try again, wait another 5 minutes, and then try a 3rd time and alert me after that? So at the minimum it would be a 10 minutes before I got a notice?
  • heartbeat retry interval: 300 seconds - So this is related just to retries. In my calculation above, it would wait another 5 minutes because this was also set to 300 seconds, right? In other words, I could maybe tweak this to check sooner after the first check failed, right?
  • request timeout: 240 seconds - This would mean wait 4 minutes if the connection didn't completely fail, but was just stuck waiting, right?

Sometimes there are just blips (like waiting for a a file to sync) and they are just normal. But normal events like that don't take 4 minutes.

I don't want the alerts going off too much. But I was thinking getting alerts sooner might help. I think the heartbeat interval of 300 seconds is reasonable. Maybe I should reduce the heartbeat retry interval to just a couple of minutes? And maybe reduce the timeout seconds a bit?

If I change settings I have to manually change them at each server, right?

Any suggestions?

Thanks.

📝 Error Message(s) or Log

No response

🐻 Uptime-Kuma Version

Sorry, I don't know how to check. Whatever pikapod uses.

💻 Operating System and Arch

Whatever is running at https://poised-moth.pikapod.net/dashboard

🌐 Browser

Google Chrome 132.0.6834.83 (Official Build) (arm64)

🖥️ Deployment Environment

@CommanderStorm
Copy link
Collaborator

Really depends what you want to prevent..
If 240s waiting is okay for your usecase, that timeout might make sense (likely you want a bit lower like 5s though).

How you do retries vs times you check is totally up to you and which services you are monitoring.. Cannot really give you good advice without knowing your SLA/.. requirements

@douglerner
Copy link
Author

Yes, I guess it does depend a lot on the use case.

Here's an example of a server issue that happened about 15 minutes ago. Uptime Kuma didn't alert me, but a simpler monitoring system that just checks every 5 minutes and tries twice before alerting let me know. I went to the server and saw that it had restarted due to a memory issue. It came back up and the simple monitor let me know. It was out of commission for about 9 minutes and I didn't need to do anything.

But I got no alert from Uptime Kuma. Am I correct in what I wrote about the parameters? Maybe instead of 3 retries I should switch to 2. Or if I understood the heartbeat retry interval, maybe just reduce that time? For example, keep on checking every 5 minutes, but if a check fails have the retry interval smaller?

Does each monitor need to have its parameters adjusted manually?

Thanks.

@CommanderStorm
Copy link
Collaborator

CommanderStorm commented Jan 31, 2025

For 2 retries, this will allert you after retrying two times.

With your current configuration (3 retries, 300s intervals between checks in any case), I would expect a Mean time to detect of approimately 15..20 minutes, assuming you have lowered the timeout as suggested above.
(how high timeouts play into the MTTD would need to be looked up).

@douglerner
Copy link
Author

I guess 15...20 minutes is too long. Just to clarify the language, with my current 3 retries set that means there are 4 tries in all? The try after 5 minutes and if that fails another 3 retries, all of those with the "heartbeat retry interval" wait time of 5 minutes?

That alone could add up to 20 minutes.

Maybe 2 retries (3 tries in all) and maybe reducing the heartbeat retry interval a bit might be a good first step.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants