Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator Deployment OOMKill. #86

Open
matthewhembree opened this issue May 31, 2023 · 11 comments
Open

Operator Deployment OOMKill. #86

matthewhembree opened this issue May 31, 2023 · 11 comments

Comments

@matthewhembree
Copy link
Contributor

matthewhembree commented May 31, 2023

The memory limits might be a little too low. I wonder if anyone else is seeing the same with this version. I'm not doing anything fancy.

Version: v0.10.0

I needed to patch them to 400Mi (Not precise. I just picked a number):

[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/1/resources/limits/memory",
    "value": "400Mi"
  }
]

Thanks!

Edit: Removed an incorrect code line reference. https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/controllers/reconciler.go#L491
@matthewhembree matthewhembree changed the title Operator Deployment crashlooping. Operator Deployment OOMKill. May 31, 2023
@adyanth
Copy link
Owner

adyanth commented May 31, 2023

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

Mine seems to be running fine with the same limits, may I ask which version of cloudflared are you using?

@matthewhembree
Copy link
Contributor Author

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

No, the operator.

This is the snippet from my kustomization.yaml:

patches:
- path: patches/cloudflare-operator-controller-manager-resources.json
  target:
    group: apps
    version: v1
    kind: Deployment
    name: cloudflare-operator-controller-manager

@matthewhembree
Copy link
Contributor Author

I'll get a log capture. Not at the system right now.

@matthewhembree
Copy link
Contributor Author

Okay. I see the confusion. I referenced the the tunnel deployment code in the original post.

This is what I meant to reference:

Pod logs:

manager I0531 06:34:20.420863       1 request.go:601] Waited for 1.002272575s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/batch/v1?timeout=32s
manager 1.6855148613748305e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
kube-rbac-proxy I0531 06:31:02.064905       1 main.go:190] Valid token audiences:
kube-rbac-proxy I0531 06:31:02.065069       1 main.go:262] Generating self signed cert as no cert is provided
kube-rbac-proxy I0531 06:31:02.628100       1 main.go:311] Starting TCP socket on 0.0.0.0:8443
kube-rbac-proxy I0531 06:31:02.628691       1 main.go:318] Listening securely on 0.0.0.0:8443
manager 1.6855148613757348e+09    INFO    setup    starting manager
manager 1.685514861376296e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
manager 1.6855148613763669e+09    INFO    Starting server    {"kind": "health probe", "addr": ":8081"}
manager I0531 06:34:21.376443       1 leaderelection.go:248] attempting to acquire leader lease cloudflare-operator-system/9f193cf8.cfargotunnel.com...
manager I0531 06:34:37.877037       1 leaderelection.go:258] successfully acquired lease cloudflare-operator-system/9f193cf8.cfargotunnel.com
manager 1.6855148778770685e+09    DEBUG    events    cloudflare-operator-controller-manager-548fc568dc-cfs8c_033a82a3-9de3-4d13-90d8-123523d8bed3 became leader    {"type": "Normal", "object": {"kind":"Lease","namespace":"cloudflare-operator-system","name":"9f193cf8.cfargotunnel.com","uid":"b5e6b291-a2a9-4593-aac0-d5a4dc43119b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1514017449"}, "reason": "LeaderElection"}
manager 1.6855148778774395e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1alpha1.Tunnel"}
manager 1.6855148778775022e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.685514877877516e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778775249e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Deployment"}
manager 1.685514877877531e+09    INFO    Starting Controller    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel"}
manager 1.6855148778777483e+09    INFO    Starting EventSource    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding", "source": "kind source: *v1alpha1.TunnelBinding"}
manager 1.6855148778777907e+09    INFO    Starting Controller    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding"}
manager 1.6855148778777394e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1alpha1.ClusterTunnel"}
manager 1.6855148778778672e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.6855148778778884e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778779e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Deployment"}
manager 1.6855148778779075e+09    INFO    Starting Controller    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel"}
Stream closed EOF for cloudflare-operator-system/cloudflare-operator-controller-manager-548fc568dc-cfs8c (manager)

It seems doubling the limit to 200Mi will get the container to successfully start.

@adyanth
Copy link
Owner

adyanth commented May 31, 2023

Interesting, the manager hasn't been updated since the last release of v0.10.0. I'm surprised how that container is out of the blue needing more memory.

Screenshot 2023-05-31 at 12 35 32 AM

It only takes about 26MiB in mine. Would you mind sharing a bit more details about your setup (as in approximately how many tunnels and services are being handled by it)? Is it by any chance running on arm and not x64, since I have not validated that myself?

I am interested to see if it is a run time thing based on usage which I should probably call out in the README somewhere since deployments of this I have seen till now never go near 100MiB.

@matthewhembree
Copy link
Contributor Author

Well this is interesting. Two EKS clusters. Different versions. Both AL2.
1.24.13 : 5.4.241-150.347.amzn2.x86_64 - lower mem
1.23.17 : 5.10.178-162.673.amzn2.x86_64 - higher mem

image

@matthewhembree
Copy link
Contributor Author

I wonder if the kube client discovery cache is bloating the memory.

I don't have an excessive number of CRDs in either. I cleaned up the 1.24 cluster before the image above.

I'll clean up the 1.23 cluster tomorrow and see what happens.

@adyanth
Copy link
Owner

adyanth commented Jun 1, 2023

That does not seem right and I cannot think of a way to debug why this one is taking more memory (other than profiling it, which I am not sure is worth the effort haha) since the containers themselves do not have any tools for you to exec into. The 50 MB sounds about right. I do not think the kube discovery has anything to do with this, but sure, lemme know. Mine used to be on k8s 1.22, now on 1.26 so the version should not be an issue.

@matthewhembree
Copy link
Contributor Author

I did get an alloc flame graph with the krew flame plugin. Github does a static rendering, so the 15min one is sort of useless when posted here.

1m:
alloc-flamegraph

15m:
long-alloc-flamegraph

I guess I have something wrong with that cluster. I'll roll this out to the rest and compare.

FWIW, there's just a single ClusterTunnel in my deployment. The overlays only change the name of the tunnel.

@adyanth
Copy link
Owner

adyanth commented Jun 2, 2023

Is that a alloc count or byte graph? Either way, all I see are k8s libraries used by the controller, nothing from the code from this project. The widest call by x/net/http2 -> compress/gzip seems like a lot of (or a large body of, depending on what graph this is) HTTP requests to the manager pod. If health checks or something like that are misconfigured (to either send a lot of requests or request with large content), it could be a reason too.

@hrrrsn
Copy link

hrrrsn commented Nov 14, 2023

fwiw, I'm seeing this behaviour on an OpenShift 4.14 (k8s 1.27) cluster:

Screenshot 2023-11-14 at 7 18 49 PM

after patching the limit, memory usage hovers around 150mb:

Screenshot 2023-11-14 at 7 28 33 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants