Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add build team availability alerts #353

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions rhobs/alerting/data_plane/prometheus.build_service_alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rhtap-build-service-github-app-alerting
labels:
tenant: rhtap
spec:
groups:
- name: build_service_github_app_alerts
interval: 1m
rules:
- alert: GitHubAppFailureAlert
expr: absent(konflux_up{service="github", check="build-service"}) == 1
for: 5m
labels:
severity: warning
annotations:
summary: "'konflux_up' availability metric missing for GitHub App in build-service."
description: >-
The 'konflux_up' availability metric for the GitHub App in the build-service has not been reported for check {{ $labels.check }} on service {{ $labels.service }} for over 5 minutes, indicating a possible service disruption.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/build-service/availability_github_app.md
24 changes: 24 additions & 0 deletions rhobs/alerting/data_plane/prometheus.image_controller_alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rhtap-image-controller-quay-alerting
labels:
tenant: rhtap
spec:
groups:
- name: image_controller_quay_alerts
interval: 1m
rules:
- alert: QuayFailureAlert
expr: absent(konflux_up{service="quay", check="image-controller"}) == 1
for: 5m
labels:
severity: warning
annotations:
summary: Availability metric 'konflux_up' missing for Quay in image-controller.
description: >-
The availability metric 'konflux_up' is missing for the Quay service in the image-controller
for more than 5 minutes, indicating a potential service failure.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/image-controller/availability_quay.md
31 changes: 31 additions & 0 deletions test/promql/tests/data_plane/github_app_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
rule_files:
- prometheus.build_service_alerts.yaml

evaluation_interval: 1m

tests:
- interval: 1m
input_series:
- series: konflux_up{service="github", check="build-service"}
values: '_x6 1'
alert_rule_test:
- alertname: GitHubAppFailureAlert
eval_time: 4m
exp_alerts: []
- alertname: GitHubAppFailureAlert
eval_time: 5m
exp_alerts:
- exp_labels:
severity: warning
check: build-service
service: github
exp_annotations:
summary: "'konflux_up' availability metric missing for GitHub App in build-service."
description: >-
The 'konflux_up' availability metric for the GitHub App in the build-service has not been reported for check build-service on service github for over 5 minutes, indicating a possible service disruption.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/build-service/availability_github_app.md
- alertname: GitHubAppFailureAlert
eval_time: 7m
exp_alerts: []
32 changes: 32 additions & 0 deletions test/promql/tests/data_plane/quay_failure_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
rule_files:
- prometheus.image_controller_alerts.yaml

evaluation_interval: 1m

tests:
- interval: 1m
input_series:
- series: konflux_up{service="quay", check="image-controller"}
values: '_x6 1'
alert_rule_test:
- alertname: QuayFailureAlert
eval_time: 4m
exp_alerts: []
- alertname: QuayFailureAlert
eval_time: 5m
exp_alerts:
- exp_labels:
severity: warning
check: image-controller
service: quay
exp_annotations:
summary: Availability metric 'konflux_up' missing for Quay in image-controller.
description: >-
The availability metric 'konflux_up' is missing for the Quay service in the image-controller
for more than 5 minutes, indicating a potential service failure.
team: build
alert_team_handle: <!subteam^S03DM1RL0TF>
runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/image-controller/availability_quay.md
- alertname: QuayFailureAlert
eval_time: 7m
exp_alerts: []