Clean Up and Improve Data Platform Alerts #1460
Labels
Data Engineering
product:data-platform
Issues related to the Data Platform product
product:infrastructure
Issues related to application and operations infrastructure
User Story
Currently, we have a Slack channel designated for data platform specific alerts: data-platform-alerts. There are a number of errors that are posted there daily related to recurring issues in Dagster and Airbyte. The error messages are not very helpful and sometimes confusing with pointing out exactly what went wrong with a certain pipeline run or sync. Most of the time, this requires additional time being spent debugging and digging into the error logs in Dagster, Airbyte, or Grafana to figure out what went wrong.
As a member of the data platform team, I want to be notified of issues and failures in our ingestion pipelines and be able to respond to them as quickly as possible. Having more detailed and specific error alerts would reduce the amount of time needed to investigate and address a particular issue. This can be achieved by improving the error handling and formatting of our notifications to increase clarity and serviceability.
Description/Context
We should start by documenting the issues we are seeing daily in the channel.
Issues and failures that affecting data flowing downstream should be addressed.
Once we have the higher priority errors addressed, we should improve the remaining alerts to be more helpful.
Plan/Design
The text was updated successfully, but these errors were encountered: