simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

reillyse · 2025-01-24T01:26:00Z

Description

New feature (non-breaking change which adds functionality)

What's Changed

Added an OOM monitor that we can use to gracefully exit when we are close to our real memory limit.

vercel · 2025-01-24T01:26:06Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
hatchet-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 24, 2025 1:26am

reillyse · 2025-01-24T01:28:48Z

SERVER_OOM_ENABLED=true SERVER_OOM_THRESHOLD_BYTES=27574104 SERVER_OOM_SIGNAL="SIGTERM" SERVER_OOM_CHECK_INTERVAL="20s" task start-dev

abelanger5 · 2025-01-24T14:31:25Z

internal/services/oom/monitor.go

+			default:
+				runtime.GC()
+				var memStats runtime.MemStats
+				runtime.ReadMemStats(&memStats)


This concerns me -- I don't think we can necessarily get a true estimate of memory usage from this. More more information: https://www.datadoghq.com/blog/go-memory-metrics/

Has there been any investigation into whether the memory reported by ReadMemStats matches the system memory (divided by two, according to the DataDog article) when we're running in a Docker image?

From what I understand, not only may the count not be accurate, but we also have to multiply threshold bytes by 2 to get the actual threshold.

this is 100% a tuning thing, we need to experiment and find out what the correct limit is, memory is always very difficult to calculate on linux boxes, even top and the various memory tools do not give an accurate picture, shared libraries are shared etc. The strategy here is to set a conservative limit and increase it. What we have now is OOMs which are a disaster and very difficult to mitigate - all of our code can just stop running right in the middle of executing which is the worst possible failure. We are running redundant engines right? So restarting an engine should be a normal part of our operations and not lead to any downtime.

simple OOM killer

2d29d9a

vercel bot deployed to Preview January 24, 2025 01:26 View deployment

reillyse requested a review from abelanger5 January 24, 2025 01:38

abelanger5 reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

reillyse commented Jan 24, 2025

vercel bot commented Jan 24, 2025 •

edited

Loading

reillyse commented Jan 24, 2025

abelanger5 Jan 24, 2025

reillyse Jan 24, 2025

simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

Are you sure you want to change the base?

simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

Conversation

reillyse commented Jan 24, 2025

Description

What's Changed

vercel bot commented Jan 24, 2025 • edited Loading

reillyse commented Jan 24, 2025

abelanger5 Jan 24, 2025

Choose a reason for hiding this comment

reillyse Jan 24, 2025

Choose a reason for hiding this comment

vercel bot commented Jan 24, 2025 •

edited

Loading