Operator Deployment OOMKill. #86

matthewhembree · 2023-05-31T01:23:25Z

The memory limits might be a little too low. I wonder if anyone else is seeing the same with this version. I'm not doing anything fancy.

Version: v0.10.0

I needed to patch them to 400Mi (Not precise. I just picked a number):

[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/1/resources/limits/memory",
    "value": "400Mi"
  }
]

Thanks!

Edit: Removed an incorrect code line reference.

https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/controllers/reconciler.go#L491

The text was updated successfully, but these errors were encountered:

adyanth · 2023-05-31T04:26:28Z

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

Mine seems to be running fine with the same limits, may I ask which version of cloudflared are you using?

matthewhembree · 2023-05-31T05:24:40Z

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

No, the operator.

This is the snippet from my kustomization.yaml:

patches:
- path: patches/cloudflare-operator-controller-manager-resources.json
  target:
    group: apps
    version: v1
    kind: Deployment
    name: cloudflare-operator-controller-manager

matthewhembree · 2023-05-31T05:28:12Z

I'll get a log capture. Not at the system right now.

matthewhembree · 2023-05-31T06:45:18Z

Okay. I see the confusion. I referenced the the tunnel deployment code in the original post.

This is what I meant to reference:

cloudflare-operator/config/manager/manager.yaml

Line 51 in c38e0cc

memory: 100Mi

Pod logs:

manager I0531 06:34:20.420863       1 request.go:601] Waited for 1.002272575s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/batch/v1?timeout=32s
manager 1.6855148613748305e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
kube-rbac-proxy I0531 06:31:02.064905       1 main.go:190] Valid token audiences:
kube-rbac-proxy I0531 06:31:02.065069       1 main.go:262] Generating self signed cert as no cert is provided
kube-rbac-proxy I0531 06:31:02.628100       1 main.go:311] Starting TCP socket on 0.0.0.0:8443
kube-rbac-proxy I0531 06:31:02.628691       1 main.go:318] Listening securely on 0.0.0.0:8443
manager 1.6855148613757348e+09    INFO    setup    starting manager
manager 1.685514861376296e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
manager 1.6855148613763669e+09    INFO    Starting server    {"kind": "health probe", "addr": ":8081"}
manager I0531 06:34:21.376443       1 leaderelection.go:248] attempting to acquire leader lease cloudflare-operator-system/9f193cf8.cfargotunnel.com...
manager I0531 06:34:37.877037       1 leaderelection.go:258] successfully acquired lease cloudflare-operator-system/9f193cf8.cfargotunnel.com
manager 1.6855148778770685e+09    DEBUG    events    cloudflare-operator-controller-manager-548fc568dc-cfs8c_033a82a3-9de3-4d13-90d8-123523d8bed3 became leader    {"type": "Normal", "object": {"kind":"Lease","namespace":"cloudflare-operator-system","name":"9f193cf8.cfargotunnel.com","uid":"b5e6b291-a2a9-4593-aac0-d5a4dc43119b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1514017449"}, "reason": "LeaderElection"}
manager 1.6855148778774395e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1alpha1.Tunnel"}
manager 1.6855148778775022e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.685514877877516e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778775249e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Deployment"}
manager 1.685514877877531e+09    INFO    Starting Controller    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel"}
manager 1.6855148778777483e+09    INFO    Starting EventSource    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding", "source": "kind source: *v1alpha1.TunnelBinding"}
manager 1.6855148778777907e+09    INFO    Starting Controller    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding"}
manager 1.6855148778777394e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1alpha1.ClusterTunnel"}
manager 1.6855148778778672e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.6855148778778884e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778779e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Deployment"}
manager 1.6855148778779075e+09    INFO    Starting Controller    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel"}
Stream closed EOF for cloudflare-operator-system/cloudflare-operator-controller-manager-548fc568dc-cfs8c (manager)

It seems doubling the limit to 200Mi will get the container to successfully start.

adyanth · 2023-05-31T07:38:51Z

Interesting, the manager hasn't been updated since the last release of v0.10.0. I'm surprised how that container is out of the blue needing more memory.

It only takes about 26MiB in mine. Would you mind sharing a bit more details about your setup (as in approximately how many tunnels and services are being handled by it)? Is it by any chance running on arm and not x64, since I have not validated that myself?

I am interested to see if it is a run time thing based on usage which I should probably call out in the README somewhere since deployments of this I have seen till now never go near 100MiB.

matthewhembree · 2023-06-01T00:11:47Z

Well this is interesting. Two EKS clusters. Different versions. Both AL2.
1.24.13 : 5.4.241-150.347.amzn2.x86_64 - lower mem
1.23.17 : 5.10.178-162.673.amzn2.x86_64 - higher mem

matthewhembree · 2023-06-01T00:16:17Z

I wonder if the kube client discovery cache is bloating the memory.

I don't have an excessive number of CRDs in either. I cleaned up the 1.24 cluster before the image above.

I'll clean up the 1.23 cluster tomorrow and see what happens.

adyanth · 2023-06-01T00:22:42Z

That does not seem right and I cannot think of a way to debug why this one is taking more memory (other than profiling it, which I am not sure is worth the effort haha) since the containers themselves do not have any tools for you to exec into. The 50 MB sounds about right. I do not think the kube discovery has anything to do with this, but sure, lemme know. Mine used to be on k8s 1.22, now on 1.26 so the version should not be an issue.

matthewhembree · 2023-06-01T17:24:36Z

I did get an alloc flame graph with the krew flame plugin. Github does a static rendering, so the 15min one is sort of useless when posted here.

1m:

15m:

I guess I have something wrong with that cluster. I'll roll this out to the rest and compare.

FWIW, there's just a single ClusterTunnel in my deployment. The overlays only change the name of the tunnel.

adyanth · 2023-06-02T02:07:57Z

Is that a alloc count or byte graph? Either way, all I see are k8s libraries used by the controller, nothing from the code from this project. The widest call by x/net/http2 -> compress/gzip seems like a lot of (or a large body of, depending on what graph this is) HTTP requests to the manager pod. If health checks or something like that are misconfigured (to either send a lot of requests or request with large content), it could be a reason too.

hrrrsn · 2023-11-14T06:30:35Z

fwiw, I'm seeing this behaviour on an OpenShift 4.14 (k8s 1.27) cluster:

after patching the limit, memory usage hovers around 150mb:

matthewhembree changed the title ~~Operator Deployment crashlooping.~~ Operator Deployment OOMKill. May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator Deployment OOMKill. #86

Operator Deployment OOMKill. #86

matthewhembree commented May 31, 2023 •

edited

Loading

adyanth commented May 31, 2023

matthewhembree commented May 31, 2023

matthewhembree commented May 31, 2023

matthewhembree commented May 31, 2023

adyanth commented May 31, 2023

matthewhembree commented Jun 1, 2023

matthewhembree commented Jun 1, 2023

adyanth commented Jun 1, 2023

matthewhembree commented Jun 1, 2023

adyanth commented Jun 2, 2023

hrrrsn commented Nov 14, 2023

Operator Deployment OOMKill. #86

Operator Deployment OOMKill. #86

Comments

matthewhembree commented May 31, 2023 • edited Loading

adyanth commented May 31, 2023

matthewhembree commented May 31, 2023

matthewhembree commented May 31, 2023

matthewhembree commented May 31, 2023

adyanth commented May 31, 2023

matthewhembree commented Jun 1, 2023

matthewhembree commented Jun 1, 2023

adyanth commented Jun 1, 2023

matthewhembree commented Jun 1, 2023

adyanth commented Jun 2, 2023

hrrrsn commented Nov 14, 2023

matthewhembree commented May 31, 2023 •

edited

Loading