Crons: Detect ingestion outages during clock ticks #79328

evanpurkhiser · 2024-10-17T22:05:20Z

There is a failure scenario that can be very disruptive for our customers.

If we have an outage in our ingestion of cron check-ins, specifically where we are dropping check-ins, then we may incorrectly mark customers cron monitors as having missed check-ins. This only happens if we drop check-ins, if we are delayed with check-ins, the clock ticks which drive missed and time-out detections will slow down to match the consumption of check-ins in our topic.

This is highly problematic as it means customers are unable to trust that cron alerts are accurate. This is, however, a difficult problem, since if check-ins never make it into the check-ins topic, how can we differentiate between a customers job failing and not sending a check-in, and us failing to ingest their check-in?

In most of our ingestion failure scenarios we have had a significant drop in check ins. That typically looks something like this:

Improved behavior

If we were able to detect this extreme drop in volume, we could create clock ticks that are marked as being unknown ticks. meaning we are moving the clock forward, but we have a high certainty that we may have lost many check-ins. When this happens instead of creating missed and timed-out check-ins that trigger alerts, we can create missed check-ins that have a "unknown" status and mark in-progress check-ins that are past their latest check-in time as "unknown", again not alerting customers. Once we are certain that we have recovered from our incident, the clock will resume producing ticks that are regular, and not marked as unknown.

Detecting ingestion volume drops

The tricky part here is deciding if we are in an incident state. Ideally we are not relying on an external service telling us that we may be in an incident, since that service itself may be part of the incident (eg, if we had relay report that it was having problems, there's no guarantee that when it's having problems it will just fail to report to us).

My proposed detection solution is rather simple. As we consume check-ins, we keep a bucket for each minute worth of check-ins, that bucket is a counter of how many check-ins were consumed for that minute. We will keep these buckets for 7 days worth of data, that's 10080 buckets.

Each time the clock ticks across a minute, we will look at the last 7 days of that particular minute we ticked over, take some type of average of those 7 counts, and compare that with the count of the minute we just ticked past. If we find that the count here is some percentage different from the previous 7 days of that minute, we will produce our clock tick with a "unknown" marker, meaning we are unsure if we collected enough data for this minute and are likely in an incident. In which case we will create misses and time-outs as "unknown".

Ignoring previous incidents

When a minute was detected as having a abnormally low volume we should reset it's count to some sentinel value like -1 so that when we pick up this minute the next 7 days, we know to ignore the data, since it will not be an accurate representation of typical volume.

Not enough data

Warning

What should we do if we don't have enough data to determine if the past minute is within the expected volume?

Implementation

We should start by implementing this as a simple metric that we track, so we can understand what our typical difference looks like each day. It's possible some days may have many more check-ins, such as Monday's ad midnight. So maybe we will need a different way to evaluate anomalies.

Warning

The implementation described above has changed. See the comment later in this issue for a description of the new approach

Implementation

PRs needed for both approaches

Give feedback

feat(crons): Record historic check-in volume counts #79448

Scope: Backend
feat(crons): Record stats for volume history at clock tick #79574

Scope: Backend
feat(crons): Update UNKNOWN status for its new usage #80348

Scope: Backend
feat(crons): Add mark_unknown clock task sentry-kafka-schemas#340
feat(crons): Add mark_unknown clock tick task #79735

Scope: Backend
feat(crons): Support dispatching mark_unknown #79785

Scope: Backend
ref(crons): Bump sentry-kafka-schemas for updated cron tasks #79783

Scope: Backend
Options

Outdated PRs for previous approach

Give feedback

The text was updated successfully, but these errors were encountered:

Part of GH-79328

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This adds a function `_evaluate_tick_decision` which looks back at the last MONITOR_VOLUME_RETENTION days worth of history and compares the minute we just ticked past to that data. We record 3 metrics from this comparison - `z_value`: This is measured as a ratio of standard deviations from the mean value - `pct_deviation`: This is the percentage we've deviated from the mean - `count`: This is the number of historic data points we're considering The z_value and pct_deviation will be most helpful in making our decision as to whether we've entered an "incident" state or not. Part of #79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

Part of getsentry/sentry#79328 This creates the new topic necessary for delaying issue occurrences so that we can delay creation of notifications in the case where we detect an anomalous system incident.

Part of GH-79328

gaprl · 2024-11-11T18:27:48Z

Hey @swanson thank you for the feedback! That's our goal, to complete eliminate these misleading alerts and regain your trust in our product.

I wanted to follow up on this bullet point:

We would love to be able to opt-in on a per-monitor basis to more "lax alerting" -- we have some recurring processes that are critical and we can accept false positives, others where we don't care as much

Today you are able to configure the alerting thresholds on a per-monitor basis. Can you clarify if these consecutive failures/successes would be enough for your use case, or if you're referring to something else?

swanson · 2024-11-11T18:59:37Z

Today you are able to configure the alerting thresholds on a per-monitor basis. Can you clarify if these consecutive failures/successes would be enough for your use case, or if you're referring to something else?

I was imagining something like a toggle field per-monitor for "Don't alert on unknown status" that would allow us to let some monitors opt-in to behavior changes related to service degradation. The current consecutive failures helps, but is not sufficient and is not that legible (we would want it to alert immediately, just that service degradation is so often the cause that we have it as a workaround)

Since I'll be adding a `clock_tick` argument to `mark_failed` it was going to start becoming confusing what the `ts` argument in `mark_failed` means. This updates mark_{ok,failed} to use more appropriate names for what the timestamp represents Refactoring as part of GH-79328

…80561) Refactoring as part of GH-79328

…80600) Refactoring as part of GH-79328 I intend for this to have a few more functions that really do not belong in the `clock_dispatch` module.

evanpurkhiser · 2024-11-12T18:30:50Z

Hey thanks for the comments on this @swanson!

run a "tracer" job through our actual infrastructure

This is an approach we considered, but this also has it's drawbacks since it means that system needs to be highly fault resistant itself, otherwise we risk a false positive of that trace not coming through causing true misses / time-out's not to be sent out. The new approach we have that delays sending notifications when the system detects an anomaly in check-in volume should work in a similar way, but will have the ability to self correct based on

We don't really care at all about the history of the check-ins, rather the notifications as you mention. So backfilling "unknown" status would be nice but honestly, we rarely look at the history during normal operations

This is super helpful feedback since I was trying to weight the added complexity of trying to mark check-ins as unknown once we "think" we're in an incident, and then if we aren't go back and correctly mark them as missed / timeout status. I think we probably won't do that, but we will mark incorrectly generated misses / timeouts as unknown once we know we are in an incident.

Thanks again for the feedback!

evanpurkhiser · 2024-11-12T18:32:19Z

Don't alert on unknown status

By default we won't alert if there is a system incident. We considered an option that still sends notifications in the case that we think we're producing false positive misses (for example, if you have a job that is so critical that you want to know if we think it could be down). But I think the need for this is going to be pretty rare, so we're probably not going to build this (at least not right away).

This function returns a DecisionResult which encapsulates the TickAnomalyDecision and AnomalyTransition values for a particular clock tick. In the future this logic will be run at each clock tick and the result will later be used to decide if we can process issue occurrences in the incident_occurrences consumer for a specific clock tick. Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts

535e3b2

Part of GH-79328

evanpurkhiser mentioned this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts #79448

Merged

evanpurkhiser added a commit that referenced this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts

f5759f0

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

a878462

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

8041706

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

e057928

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts (#79448)

d3db769

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

evanpurkhiser mentioned this issue Oct 22, 2024

feat(crons): Record stats for volume history at clock tick #79574

Merged

cmanallen pushed a commit that referenced this issue Oct 23, 2024

feat(crons): Record historic check-in volume counts (#79448)

819a4b7

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

evanpurkhiser mentioned this issue Oct 23, 2024

feat(crons): Add "volume_anomaly_result" result to clock-tick getsentry/sentry-kafka-schemas#339

Merged

evanpurkhiser mentioned this issue Oct 23, 2024

feat(crons): Add mark_unknown clock task getsentry/sentry-kafka-schemas#340

Merged

evanpurkhiser mentioned this issue Oct 24, 2024

feat(crons): Add volume_anomaly_result to mark_missing task getsentry/sentry-kafka-schemas#341

Merged

evanpurkhiser added a commit that referenced this issue Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult

2bab3d1

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

This was referenced Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult #79715

Merged

feat(crons): Begin reporting volume_anomaly_result #79729

Merged

evanpurkhiser added a commit that referenced this issue Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult (#79715)

0fc47b1

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

evanpurkhiser mentioned this issue Oct 25, 2024

feat(crons): Add mark_unknown clock tick task #79735

Merged

getsantry bot added this to GitHub Issues with 👀 3 Nov 11, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Nov 11, 2024

evanpurkhiser added a commit that referenced this issue Nov 11, 2024

feat(crons): Add empty incident_occurrences_consumer

abb60d2

Part of GH-79328

evanpurkhiser mentioned this issue Nov 11, 2024

feat(crons): Add empty incident_occurrences_consumer #80527

Merged

evanpurkhiser added a commit that referenced this issue Nov 11, 2024

feat(crons): Add empty incident_occurrences_consumer

f4b63b4

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Nov 11, 2024

feat(crons): Add empty incident_occurrences_consumer

49475f0

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Nov 11, 2024

feat(crons): Add empty incident_occurrences_consumer (#80527)

5fc9f67

Part of GH-79328

getsantry bot removed the Waiting for: Product Owner label Nov 11, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Nov 11, 2024

getsantry bot added the Waiting for: Product Owner label Nov 11, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Nov 11, 2024

This was referenced Nov 11, 2024

ref(crons): Clarify ts argument in mark_{ok,failed} #80560

Merged

ref(crons): Move failure_issue_threshold to try_incident_threshold #80561

Merged

ref(crons): Move volume recording logic to system_incidents module #80600

Merged

evanpurkhiser added a commit that referenced this issue Nov 12, 2024

ref(crons): Move failure_issue_threshold to try_incident_threshold (#…

4fa75fb

…80561) Refactoring as part of GH-79328

getsantry bot removed the Waiting for: Product Owner label Nov 12, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Nov 12, 2024

evanpurkhiser mentioned this issue Nov 13, 2024

feat(crons): Implement make_clock_tick_decision #80640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crons: Detect ingestion outages during clock ticks #79328

Crons: Detect ingestion outages during clock ticks #79328

evanpurkhiser commented Oct 17, 2024 •

edited

Loading

PRs needed for both approaches

Outdated PRs for previous approach

PRs needed for new approach

gaprl commented Nov 11, 2024

swanson commented Nov 11, 2024

evanpurkhiser commented Nov 12, 2024

evanpurkhiser commented Nov 12, 2024

Crons: Detect ingestion outages during clock ticks #79328

Crons: Detect ingestion outages during clock ticks #79328

Comments

evanpurkhiser commented Oct 17, 2024 • edited Loading

Improved behavior

Detecting ingestion volume drops

Ignoring previous incidents

Not enough data

Implementation

Implementation

PRs needed for both approaches

Outdated PRs for previous approach

PRs needed for new approach

gaprl commented Nov 11, 2024

swanson commented Nov 11, 2024

evanpurkhiser commented Nov 12, 2024

evanpurkhiser commented Nov 12, 2024

evanpurkhiser commented Oct 17, 2024 •

edited

Loading