feat(crons): Record stats for volume history at clock tick #79574

evanpurkhiser · 2024-10-22T22:47:57Z

This adds a function _evaluate_tick_decision which looks back at the last MONITOR_VOLUME_RETENTION days worth of history and compares the minute we just ticked past to that data.

We record 3 metrics from this comparison

z_value: This is measured as a ratio of standard deviations from the mean value
pct_deviation: This is the percentage we've deviated from the mean
count: This is the number of historic data points we're considering

The z_value and pct_deviation will be most helpful in making our decision as to whether we've entered an "incident" state or not.

Part of #79328

src/sentry/monitors/clock_dispatch.py

ram-senth · 2024-10-23T15:50:36Z

src/sentry/monitors/clock_dispatch.py

+    historic_mean = statistics.mean(historic_volume)
+    historic_stdev = statistics.stdev(historic_volume)
+
+    historic_stdev_pct = (historic_stdev / historic_mean) * 100


Curious, this metric (aka coefficient of variation) is not used in the actual logic. Is that intentional?

Yeah right now all this function is doing is recording metrics.

I want to see what the numbers look like before making any decisions on what our thresholds are.

ram-senth · 2024-10-23T16:10:18Z

src/sentry/monitors/clock_dispatch.py

+
+    # Calculate the z-score of our past minutes volume in comparison to the
+    # historic volume data. The z-score is measured in terms of standard
+    # deviations from the mean


This interpretation of z-score measuring number of standard deviations from the mean is applicable only for normally distributed data. I would recommend looking at the distribution of per-minute volume. If it not normally distributed then I would recommend using different metric. Seer uses interquartile range for this same reason.

Going to include it for now. I'll take a look out our existing data but I am pretty sure it's going to be relatively normally distributed.

codecov · 2024-10-23T19:05:03Z

Codecov Report

Attention: Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sentry/monitors/clock_dispatch.py	87.87%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #79574       +/-   ##
===========================================
+ Coverage   57.67%   78.47%   +20.79%     
===========================================
  Files        7125     7137       +12     
  Lines      315226   315781      +555     
  Branches    43383    43442       +59     
===========================================
+ Hits       181812   247794    +65982     
+ Misses     128695    61661    -67034     
- Partials     4719     6326     +1607

This adds a function `_evaluate_tick_decision` which looks back at the last MONITOR_VOLUME_RETENTION days worth of history and compares the minute we just ticked past to that data. We record 3 metrics from this comparison - z_value: This is measured as a ratio of standard deviations from the mean value - pct_deviation: This is the percentage we've deviated from the mean - count: This is the number of historic data points we're considering The z_value and pct_deviation will be most helpful in making our decision as to whether we've entered an "incident" state or not.

wedamija · 2024-10-24T17:44:23Z

src/sentry/monitors/clock_dispatch.py

+    if not options.get("crons.tick_volume_anomaly_detection"):
+        return


Nit: Any reason to not use a feature flag?

I couldn’t remember how to use it when we don’t have an organization lol

sentry-io · 2024-10-24T19:59:15Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ StatisticsError: stdev requires at least two data points monitors.monitor_consumer View Issue

_{Did you find this useful? React with a 👍 or 👎}

evanpurkhiser requested a review from a team as a code owner October 22, 2024 22:47

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 22, 2024

vercel bot deployed to Preview October 22, 2024 22:54 View deployment

wedamija reviewed Oct 22, 2024

View reviewed changes

ram-senth reviewed Oct 23, 2024

View reviewed changes

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-record-stats-for-volume-history-at-clock-tick branch from 19b0130 to 95182d9 Compare October 23, 2024 17:28

vercel bot deployed to Preview October 23, 2024 17:35 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-record-stats-for-volume-history-at-clock-tick branch from 20199a4 to 630ee8b Compare October 23, 2024 18:29

vercel bot deployed to Preview October 23, 2024 18:37 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-record-stats-for-volume-history-at-clock-tick branch from 90fa41b to e2280f5 Compare October 23, 2024 19:34

evanpurkhiser requested a review from wedamija October 23, 2024 19:39

vercel bot deployed to Preview October 23, 2024 19:40 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-record-stats-for-volume-history-at-clock-tick branch from e2280f5 to e23481f Compare October 23, 2024 19:54

vercel bot deployed to Preview October 23, 2024 19:57 View deployment

evanpurkhiser mentioned this pull request Oct 24, 2024

Crons: Detect ingestion outages during clock ticks #79328

Open

wedamija approved these changes Oct 24, 2024

View reviewed changes

evanpurkhiser merged commit 3a23ad2 into master Oct 24, 2024
49 of 50 checks passed

evanpurkhiser deleted the evanpurkhiser/feat-crons-record-stats-for-volume-history-at-clock-tick branch October 24, 2024 18:01

github-actions bot locked and limited conversation to collaborators Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(crons): Record stats for volume history at clock tick #79574

feat(crons): Record stats for volume history at clock tick #79574

evanpurkhiser commented Oct 22, 2024 •

edited

Loading

ram-senth Oct 23, 2024 •

edited

Loading

evanpurkhiser Oct 23, 2024

ram-senth Oct 23, 2024 •

edited

Loading

evanpurkhiser Oct 23, 2024

codecov bot commented Oct 23, 2024 •

edited

Loading

wedamija Oct 24, 2024

evanpurkhiser Oct 24, 2024

sentry-io bot commented Oct 24, 2024

		if not options.get("crons.tick_volume_anomaly_detection"):
		return

feat(crons): Record stats for volume history at clock tick #79574

feat(crons): Record stats for volume history at clock tick #79574

Conversation

evanpurkhiser commented Oct 22, 2024 • edited Loading

ram-senth Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

evanpurkhiser Oct 23, 2024

Choose a reason for hiding this comment

ram-senth Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

evanpurkhiser Oct 23, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

wedamija Oct 24, 2024

Choose a reason for hiding this comment

evanpurkhiser Oct 24, 2024

Choose a reason for hiding this comment

sentry-io bot commented Oct 24, 2024

Suspect Issues

evanpurkhiser commented Oct 22, 2024 •

edited

Loading

ram-senth Oct 23, 2024 •

edited

Loading

ram-senth Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading