Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple OOM Monitor - we gracefully quit before we hit our OOM limit #1215

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

reillyse
Copy link
Contributor

Description

  • New feature (non-breaking change which adds functionality)

What's Changed

  • Added an OOM monitor that we can use to gracefully exit when we are close to our real memory limit.

Copy link

vercel bot commented Jan 24, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
hatchet-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 24, 2025 1:26am

@reillyse
Copy link
Contributor Author

SERVER_OOM_ENABLED=true SERVER_OOM_THRESHOLD_BYTES=27574104 SERVER_OOM_SIGNAL="SIGTERM" SERVER_OOM_CHECK_INTERVAL="20s" task start-dev

@reillyse reillyse requested a review from abelanger5 January 24, 2025 01:38
default:
runtime.GC()
var memStats runtime.MemStats
runtime.ReadMemStats(&memStats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concerns me -- I don't think we can necessarily get a true estimate of memory usage from this. More more information: https://www.datadoghq.com/blog/go-memory-metrics/

Has there been any investigation into whether the memory reported by ReadMemStats matches the system memory (divided by two, according to the DataDog article) when we're running in a Docker image?

From what I understand, not only may the count not be accurate, but we also have to multiply threshold bytes by 2 to get the actual threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 100% a tuning thing, we need to experiment and find out what the correct limit is, memory is always very difficult to calculate on linux boxes, even top and the various memory tools do not give an accurate picture, shared libraries are shared etc. The strategy here is to set a conservative limit and increase it. What we have now is OOMs which are a disaster and very difficult to mitigate - all of our code can just stop running right in the middle of executing which is the worst possible failure. We are running redundant engines right? So restarting an engine should be a normal part of our operations and not lead to any downtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants