Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

douglerner · 2025-01-30T22:20:16Z

⚠️ Please verify that this question has NOT been raised before.

I checked and didn't find similar issue

🛡️ Security Policy

I agree to have read this project Security Policy

📝 Describe your problem

I have some servers I'm monitoring and sometimes they become non-responsive and I get alerts. Sometimes the cause is the server crashed and cannot be connected to and needs a manual restart. Sometimes there was a segmentation fault crash and it's self-restarting. Sometimes it just goes unresponsive. Sometimes it's responsive, but the keywords aren't present (maybe it's syncing something).

I don't think I have the right balance in the heartbeat interval, retries, heartbeat retry interval, and request timeout settings and I'd like to confirm what they mean and maybe get some advice. For example, the other day I didn't get a notification even though the site was down (in this case not-responsive) for 14 minutes. I knew it was down because of a simple connection alert from another system.

These are my current settings:

heartbeat interval: 300 seconds. That just means to test every 5 minutes. I think I get that.
retries: 3 - I think that means it will try, wait 5 minutes, try again, wait another 5 minutes, and then try a 3rd time and alert me after that? So at the minimum it would be a 10 minutes before I got a notice?
heartbeat retry interval: 300 seconds - So this is related just to retries. In my calculation above, it would wait another 5 minutes because this was also set to 300 seconds, right? In other words, I could maybe tweak this to check sooner after the first check failed, right?
request timeout: 240 seconds - This would mean wait 4 minutes if the connection didn't completely fail, but was just stuck waiting, right?

Sometimes there are just blips (like waiting for a a file to sync) and they are just normal. But normal events like that don't take 4 minutes.

I don't want the alerts going off too much. But I was thinking getting alerts sooner might help. I think the heartbeat interval of 300 seconds is reasonable. Maybe I should reduce the heartbeat retry interval to just a couple of minutes? And maybe reduce the timeout seconds a bit?

If I change settings I have to manually change them at each server, right?

Any suggestions?

Thanks.

📝 Error Message(s) or Log

No response

🐻 Uptime-Kuma Version

Sorry, I don't know how to check. Whatever pikapod uses.

💻 Operating System and Arch

Whatever is running at https://poised-moth.pikapod.net/dashboard

🌐 Browser

Google Chrome 132.0.6834.83 (Official Build) (arm64)

🖥️ Deployment Environment

Runtime: https://poised-moth.pikapod.net/dashboard
Database: whatever pikapod uses
Filesystem used to store the database on: pikapod
number of monitors: 9

CommanderStorm · 2025-01-30T23:37:47Z

Really depends what you want to prevent..
If 240s waiting is okay for your usecase, that timeout might make sense (likely you want a bit lower like 5s though).

How you do retries vs times you check is totally up to you and which services you are monitoring.. Cannot really give you good advice without knowing your SLA/.. requirements

douglerner · 2025-01-31T00:01:22Z

Yes, I guess it does depend a lot on the use case.

Here's an example of a server issue that happened about 15 minutes ago. Uptime Kuma didn't alert me, but a simpler monitoring system that just checks every 5 minutes and tries twice before alerting let me know. I went to the server and saw that it had restarted due to a memory issue. It came back up and the simple monitor let me know. It was out of commission for about 9 minutes and I didn't need to do anything.

But I got no alert from Uptime Kuma. Am I correct in what I wrote about the parameters? Maybe instead of 3 retries I should switch to 2. Or if I understood the heartbeat retry interval, maybe just reduce that time? For example, keep on checking every 5 minutes, but if a check fails have the retry interval smaller?

Does each monitor need to have its parameters adjusted manually?

Thanks.

CommanderStorm · 2025-01-31T00:27:47Z

For 2 retries, this will allert you after retrying two times.

With your current configuration (3 retries, 300s intervals between checks in any case), I would expect a Mean time to detect of approimately 15..20 minutes, assuming you have lowered the timeout as suggested above.
(how high timeouts play into the MTTD would need to be looked up).

douglerner · 2025-01-31T00:41:11Z

I guess 15...20 minutes is too long. Just to clarify the language, with my current 3 retries set that means there are 4 tries in all? The try after 5 minutes and if that fails another 3 retries, all of those with the "heartbeat retry interval" wait time of 5 minutes?

That alone could add up to 20 minutes.

Maybe 2 retries (3 tries in all) and maybe reducing the heartbeat retry interval a bit might be a good first step.

Thanks.

douglerner added the help label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

douglerner commented Jan 30, 2025

CommanderStorm commented Jan 30, 2025

douglerner commented Jan 31, 2025

CommanderStorm commented Jan 31, 2025 •

edited

Loading

douglerner commented Jan 31, 2025

Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

Working, but need advice on settings because it's not checking frequently enough, but I don't want to overdo it #5586

Comments

douglerner commented Jan 30, 2025

⚠️ Please verify that this question has NOT been raised before.

🛡️ Security Policy

📝 Describe your problem

📝 Error Message(s) or Log

🐻 Uptime-Kuma Version

💻 Operating System and Arch

🌐 Browser

🖥️ Deployment Environment

CommanderStorm commented Jan 30, 2025

douglerner commented Jan 31, 2025

CommanderStorm commented Jan 31, 2025 • edited Loading

douglerner commented Jan 31, 2025

CommanderStorm commented Jan 31, 2025 •

edited

Loading