-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post mortem report for incident on Aug 26, 2024 #473
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for openbuildservice ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Describe what went well, what went wrong and where we go lucky during the resolution of this problem. | ||
--> | ||
|
||
- Ensure all connected tables have the same column sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not only one way to do this. This requires knowledge of the system architecture.
One thing I can point out for this is a through review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this isn't something we learned. We know we should do code-review :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it rather points to a bad architectural decision that we should revise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved this reasoning 🙂
@darix brought up some worthwhile thoughts in openSUSE/open-build-service#16751 |
1975ebb
to
0c3cf5b
Compare
I've checked the conversation. Sounds like we are going with the migration to change column size |
For sure but what about the other things mentioned? Like more sophisticated exception handling... |
0c3cf5b
to
2cd51cb
Compare
Done. I've mentioned this in action items |
<!-- | ||
Classify the severity of this problem. We usually say: | ||
- service degradation: if only a few customers where impacted | ||
- severe service degradation: if nearly every customer was impacted | ||
- downtime: if every visit to the OBS ended up on some error page | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
Classify the severity of this problem. We usually say: | |
- service degradation: if only a few customers where impacted | |
- severe service degradation: if nearly every customer was impacted | |
- downtime: if every visit to the OBS ended up on some error page | |
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this will revive the excerpt on the index page...
<!-- | ||
What happened, for how long and who was impacted by it? | ||
For customers to be able to identify if their problem related to this post mortem or not. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
What happened, for how long and who was impacted by it? | |
For customers to be able to identify if their problem related to this post mortem or not. | |
--> |
<!-- | ||
How did you resolve or work around this problem? | ||
For customers and community to understand what happened technically. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
How did you resolve or work around this problem? | |
For customers and community to understand what happened technically. | |
--> |
<!-- | ||
Are there any actions we are going to do that are not done yet? | ||
For customers and community to be able to follow up on this. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
Are there any actions we are going to do that are not done yet? | |
For customers and community to be able to follow up on this. | |
--> |
<!-- | ||
Describe what went well, what went wrong and where we go lucky during the resolution of this problem. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
Describe what went well, what went wrong and where we go lucky during the resolution of this problem. | |
--> |
|
||
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC. | ||
|
||
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size. | |
The cause of this was mis-matching database table column size restrictions we did not notice. |
For customers to be able to identify if their problem related to this post mortem or not. | ||
--> | ||
|
||
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Friday, August 23rd at 11:59 UTC, Email and Web notifications stopped working until August 26th at 10:16 UTC. | |
No user of build.opensuse.org received notifications via the web interface / email from Friday, August 23rd at 11:59 UTC until Monday, August 26th at 10:16 UTC. No notification / email was lost, all of them got delivered with a delay. Some of them with a delay of several days. |
@@ -0,0 +1,75 @@ | |||
--- | |||
layout: post | |||
title: "Degraded performance of OBS Web and Email Notifications system" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: "Degraded performance of OBS Web and Email Notifications system" | |
title: "Service degradation of OBS Web and Email Notifications system" |
|
||
In the notification creation process, the `payload` field in the `events` table is copied to the `event_payload` field in the `notifications` table. These fields are required to have the same column size, but we discovered they don't have the same size. | ||
|
||
As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, when the system attempted to create a notification for an event with a long description in its payload, an exception was raised, causing the notification creation process to fail. | |
As a result, when the system attempted to create a notification an exception was raised, causing the notification creation process to fail entirely. |
|
||
## Detection | ||
|
||
We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We received the first Grafana alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. Upon checking Errbit, we found a single exception with over 100,000 occurrences | |
We received the first alert on Friday, August 23rd at 15:00 UTC, indicating that no notifications had been sent in the last three hours. |
|
||
## Root Cause | ||
|
||
The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause of the problem was a mismatch in column sizes between the `payload` field in the `events` table and the `event_payload` field in the `notifications` table. Because of this mismatch, when the system attempted to create a notification from an event with a very long description in its payload, an exception was raised, that blocked the creation of further email and web notifications. | |
In February we changed the size limitations of the `payload` field in the `events` table from TEXT to MEDIUMTEXT ([PR#15649](https://github.com/openSUSE/open-build-service/pull/15649)). During the notification process we copy the content of this column to another to the `event_payload` field in the `notifications` table with it TEXT. This mismatch caused an exception to be raised when the system attempted to create a notification from an event with a very long `payload` column. | |
Additionally the queue the handles the notification creations work sequentially, with a very many retries and very long hold off time. That blocked the creation of *all* email and web notifications. |
For customers and community to understand what happened technically. | ||
--> | ||
|
||
We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were able to identify the problematic event, take a backup of that event, and then delete it. After this, the system began to recover gradually. | |
We identified the problematic event and took it out of the queue. After this, the system began to recover itself gradually. |
| Action Item | Owner | | ||
|--- |--- | | ||
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team | | ||
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment in a PR is not a way for us to track work...
Describe what went well, what went wrong and where we go lucky during the resolution of this problem. | ||
--> | ||
|
||
- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And where are we going to do that? And when?
--> | ||
|
||
- Ensure that all connected tables have consistent column sizes by thoroughly reviewing the system architecture. We recognize that there is close coupling of components, and this design can be improved. | ||
- We had failed notifications in delayed jobs, and that saved us from data loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you want to say here...
| Action Item | Owner | | ||
|--- |--- | | ||
| [Data migration to update column size](https://github.com/openSUSE/open-build-service/pull/16751) | Developer Team | | ||
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [Improve exception handling](ihttps://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team | | |
| [Improve exception handling](https://github.com/openSUSE/open-build-service/pull/16751#issuecomment-2309776466) | Developer Team | |
No description provided.