Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap / Limit number of objects ingested for native and Custom Resource Metrics #2622

Open
mrueg opened this issue Mar 3, 2025 · 3 comments · May be fixed by #2626
Open

Cap / Limit number of objects ingested for native and Custom Resource Metrics #2622

mrueg opened this issue Mar 3, 2025 · 3 comments · May be fixed by #2626
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mrueg
Copy link
Member

mrueg commented Mar 3, 2025

What would you like to be added:
KSM should have the ability to set an upper limit on number of objects ingested.
Why is this needed:
We observed an event where an autoscaler by accident created 10k+ ReplicaSets which KSM tried to report on. This caused KSM to run out of memory and we lost visibility into the cluster.
I know we can limit it already on the scraping end in Prometheus, this is just to avoid that ksm is running out of resources and to give another signal on what's going on in the cluster.
Describe the solution you'd like

  • Have a generic and a resource-level command-line option that KSM should use to limit number of items read from the Kubernetes API.
  • Have metrics exposed kube_objects_watched{group="foo", kind="bar" version="baz"} and kube_objects_watched_max which shows the configuration limit to allow alerting if the threshold gets hit.

Additional context

@mrueg mrueg added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 3, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 3, 2025
@rexagod
Copy link
Member

rexagod commented Mar 4, 2025

/triage accepted

If this is planned further down the line, would you prefer if I moved this issue to https://github.com/rexagod/resource-state-metrics?

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 4, 2025
@mrueg
Copy link
Member Author

mrueg commented Mar 4, 2025

Rather duplicate it, I think the unbound number of objects on crs and native resources both need to be addressed.

@dgrisonnet
Copy link
Member

+1 for that feature

give another signal on what's going on in the cluster

Having new metrics and an alert that tells us when ksm reaches object limits could definitely help. For a more in-depth investigation, we could document using the apiserver_storage_objects metric from the kube-apiserver as well as audit log to be able to tell what is happening and who's the rogue client creating the objects.

@mrueg mrueg linked a pull request Mar 10, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants