-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586
Comments
Really depends what you want to prevent.. How you do retries vs times you check is totally up to you and which services you are monitoring.. Cannot really give you good advice without knowing your SLA/.. requirements |
Yes, I guess it does depend a lot on the use case. Here's an example of a server issue that happened about 15 minutes ago. Uptime Kuma didn't alert me, but a simpler monitoring system that just checks every 5 minutes and tries twice before alerting let me know. I went to the server and saw that it had restarted due to a memory issue. It came back up and the simple monitor let me know. It was out of commission for about 9 minutes and I didn't need to do anything. But I got no alert from Uptime Kuma. Am I correct in what I wrote about the parameters? Maybe instead of 3 retries I should switch to 2. Or if I understood the heartbeat retry interval, maybe just reduce that time? For example, keep on checking every 5 minutes, but if a check fails have the retry interval smaller? Does each monitor need to have its parameters adjusted manually? Thanks. |
For 2 retries, this will allert you after retrying two times. With your current configuration (3 retries, 300s intervals between checks in any case), I would expect a Mean time to detect of approimately 15..20 minutes, assuming you have lowered the timeout as suggested above. |
I guess 15...20 minutes is too long. Just to clarify the language, with my current 3 retries set that means there are 4 tries in all? The try after 5 minutes and if that fails another 3 retries, all of those with the "heartbeat retry interval" wait time of 5 minutes? That alone could add up to 20 minutes. Maybe 2 retries (3 tries in all) and maybe reducing the heartbeat retry interval a bit might be a good first step. Thanks. |
🛡️ Security Policy
📝 Describe your problem
I have some servers I'm monitoring and sometimes they become non-responsive and I get alerts. Sometimes the cause is the server crashed and cannot be connected to and needs a manual restart. Sometimes there was a segmentation fault crash and it's self-restarting. Sometimes it just goes unresponsive. Sometimes it's responsive, but the keywords aren't present (maybe it's syncing something).
I don't think I have the right balance in the heartbeat interval, retries, heartbeat retry interval, and request timeout settings and I'd like to confirm what they mean and maybe get some advice. For example, the other day I didn't get a notification even though the site was down (in this case not-responsive) for 14 minutes. I knew it was down because of a simple connection alert from another system.
These are my current settings:
Sometimes there are just blips (like waiting for a a file to sync) and they are just normal. But normal events like that don't take 4 minutes.
I don't want the alerts going off too much. But I was thinking getting alerts sooner might help. I think the heartbeat interval of 300 seconds is reasonable. Maybe I should reduce the heartbeat retry interval to just a couple of minutes? And maybe reduce the timeout seconds a bit?
If I change settings I have to manually change them at each server, right?
Any suggestions?
Thanks.
📝 Error Message(s) or Log
No response
🐻 Uptime-Kuma Version
Sorry, I don't know how to check. Whatever pikapod uses.
💻 Operating System and Arch
Whatever is running at https://poised-moth.pikapod.net/dashboard
🌐 Browser
Google Chrome 132.0.6834.83 (Official Build) (arm64)
🖥️ Deployment Environment
The text was updated successfully, but these errors were encountered: