Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM killed #2265

Closed
qicz opened this issue Dec 5, 2023 · 20 comments
Closed

OOM killed #2265

qicz opened this issue Dec 5, 2023 · 20 comments
Assignees
Labels
kind/bug Something isn't working stale triage

Comments

@qicz
Copy link
Member

qicz commented Dec 5, 2023

Description:

watch some HTTPRoute that has some error. maybe the service does not exist. the EG has been killed due to Reconcile them.

Logs:
image

@qicz qicz added kind/bug Something isn't working triage labels Dec 5, 2023
@qicz
Copy link
Member Author

qicz commented Dec 5, 2023

IMO, should set RequeueAfter to requeue

@Xunzhuo
Copy link
Member

Xunzhuo commented Dec 6, 2023

I did not reproduce it, can you provide the steps to reproduce it ? @qicz

@cnvergence
Copy link
Member

It could be that we are missing a valid error return on getting those resources

@qicz
Copy link
Member Author

qicz commented Dec 7, 2023

I did not reproduce it, can you provide the steps to reproduce it ? @qicz

one HTTPRoute with service that does not exist

@Xunzhuo
Copy link
Member

Xunzhuo commented Dec 7, 2023

Tried that and just HTTPRoute reported BackendNotFound, the eg works still well

@qicz
Copy link
Member Author

qicz commented Dec 13, 2023

Tried that and just HTTPRoute reported BackendNotFound, the eg works still well

this report too often and there are more invalid HTTPRoute, the EG has been killed due to Reconcile them.

@zzjin
Copy link
Contributor

zzjin commented Dec 18, 2023

@qicz I'm facing same error here.
But My usage is setting about ~1300 HTTPRoute CR with about ~20 Gateway with mergeGateway=true.
image

May be it's not non-exists backends cause eg oom, but the count of gateway api crs cause eg oom, I'm facing that deployment envoy-gateway pod eats too many memory.
image
default eg memory limit is 1g, you can change this to unlimited, but the problem is still the problem.

@arkodg
Copy link
Contributor

arkodg commented Dec 18, 2023

@qicz in your logs, can you please paste the entire log showing the namespace and name of service, along with kubectl info on the service as well the httproute that is linking to it ?

@qicz
Copy link
Member Author

qicz commented Jan 10, 2024

@qicz in your logs, can you please paste the entire log showing the namespace and name of service, along with kubectl info on the service as well the httproute that is linking to it ?

@arkodg sorry reply slowly. the namespace and service are from my company app, so they have been cleared by me. sorry for this.

@qicz
Copy link
Member Author

qicz commented Jan 10, 2024

@qicz I'm facing same error here. But My usage is setting about ~1300 HTTPRoute CR with about ~20 Gateway with mergeGateway=true. image

May be it's not non-exists backends cause eg oom, but the count of gateway api crs cause eg oom, I'm facing that deployment envoy-gateway pod eats too many memory. image default eg memory limit is 1g, you can change this to unlimited, but the problem is still the problem.

in my case, there are only ~30 HTTPRoute. but can not set the memory to unlimited, it is bad for the Kubernetes cluster

@zzjin
Copy link
Contributor

zzjin commented Jan 10, 2024

in my case, there are only ~30 HTTPRoute. but can not set the memory to unlimited, it is bad for the Kubernetes cluster

No need to be unlimited, but some thing larger for routes is enough.
But as always, it must be some thing wrong with oom here.

@arkodg
Copy link
Contributor

arkodg commented Jan 10, 2024

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

@qicz
Copy link
Member Author

qicz commented Jan 17, 2024

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

The analysis concludes that the OOM problem is that there are many secrets and the MEM limit is not set properly.

@qicz
Copy link
Member Author

qicz commented Jan 17, 2024

suggestion: using protobuf connect to Kubernetes to optimize the mem. xref #1596

@zzjin
Copy link
Contributor

zzjin commented Jan 17, 2024

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

The analysis concludes that the OOM problem is that there are many secrets and the MEM limit is not set properly.

May be that's the problem, our cluster we have about ~3000 ingress with https,witch means about ~3000 secrets.

@arkodg
Copy link
Contributor

arkodg commented Jan 17, 2024

@qicz can you share mem stats of EG before & after #1596 ?

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Feb 16, 2024
@arkodg
Copy link
Contributor

arkodg commented May 22, 2024

closing due to no response, please reopen if you hit this issue again

@arkodg arkodg closed this as completed May 22, 2024
@miguelvr
Copy link

miguelvr commented Jul 15, 2024

Hi @arkodg, I've hit the same issue.

It seems like the envoy gateway is creating infinite HTTPRoutes for the HTTP01 challenge, while the challenge is not satisfied. My (unproven) theory is that it is provisioning the HTTPRoute resource with generate_name instead of using a predictable name, and this causes an infinite reconciliation loop.

EDIT: by looking at the HTTPRoute owner references, this now looks like a cert-manager issue

@arkodg
Copy link
Contributor

arkodg commented Jul 15, 2024

thanks for debugging this one @miguelvr , cross linking the cert-manager issue here cert-manager/cert-manager#7176

@envoyproxy/gateway-maintainers should we consider something like envoy's overload manager where we stop reconciling more resources (flag this in a GatewayClass status) in case we hit some specified memory threshold ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working stale triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants