toolkit: Reconciliation status as Prometheus metrics #170
-
The toolkit controllers are recording the reconciliation status in a condition named Ready: status:
conditions:
- type: Ready
lastTransitionTime: "<last status change timestamp>"
message: "<human readable description including errors if any>"
reason: "<Progressing|Suspended|DependencyNotReady|ReconciliationSucceeded|etc>"
status: "<Unknown|True|False>" Metric specWe could export the ready condition as a Prometheus gauge:
Gauge values:
Alert manager example: groups:
- name: GitOpsToolkit
rules:
- alert: ReconciliationFailure
expr: gitops_toolkit_ready_condition == -1
for: 10m
labels:
severity: page
annotations:
summary: '{{ $labels.kind }} {{ $labels.namespace }}/{{ $labels.name }} reconciliation has been failing with {{ $labels.reason }} for more than ten minutes.' Implementation specCreate a
Monitoring stackThe ready condition metrics would allow us to build Grafana dashboards for monitoring the reconciliation process and drilldown to individual objects. The dashboards could be used to diagnose specific issues if we plot the reconciliations failures grouped by kind/namespace/name/reason. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
What about just exposing all conditions by including
|
Beta Was this translation helpful? Give feedback.
-
This is a misuse of gauges, which are to record a quantity. The things you'd expect to do with a gauge are:
Neither of these yields anything useful when the gauge value represents a discrete state. (OK, you could make a chart of the state transitions of a single object, provided you could select a single object with labels. But there are better ways to get that information, like events). A lesser misuse might be to give an aggregate count of the number of items in each state, which (since this can fall as well as rise) would need to be a gauge.
This gives a sensible value when aggregated (and charted in aggregate), and can be used for alerts. However, you'd need to set the gauge for all values of |
Beta Was this translation helpful? Give feedback.
-
The custom metrics implementation is tracked here #329 |
Beta Was this translation helpful? Give feedback.
This is a misuse of gauges, which are to record a quantity. The things you'd expect to do with a gauge are:
Neither of these yields anything useful when the gauge value represents a discrete state. (OK, you could make a chart of the state transitions of a single object, provided you could select a single object with labels. But there are better ways to get that information, like events).
A lesser misuse might be to give an aggregate count of the number of items in each state, which (since this can fall as well as rise) would need to be a gauge.