Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Up and Improve Data Platform Alerts #1460

Open
2 of 11 tasks
quazi-h opened this issue Feb 3, 2025 · 1 comment
Open
2 of 11 tasks

Clean Up and Improve Data Platform Alerts #1460

quazi-h opened this issue Feb 3, 2025 · 1 comment
Assignees
Labels
Data Engineering product:data-platform Issues related to the Data Platform product product:infrastructure Issues related to application and operations infrastructure

Comments

@quazi-h
Copy link
Contributor

quazi-h commented Feb 3, 2025

User Story

Currently, we have a Slack channel designated for data platform specific alerts: data-platform-alerts. There are a number of errors that are posted there daily related to recurring issues in Dagster and Airbyte. The error messages are not very helpful and sometimes confusing with pointing out exactly what went wrong with a certain pipeline run or sync. Most of the time, this requires additional time being spent debugging and digging into the error logs in Dagster, Airbyte, or Grafana to figure out what went wrong.

As a member of the data platform team, I want to be notified of issues and failures in our ingestion pipelines and be able to respond to them as quickly as possible. Having more detailed and specific error alerts would reduce the amount of time needed to investigate and address a particular issue. This can be achieved by improving the error handling and formatting of our notifications to increase clarity and serviceability.

Description/Context

We should start by documenting the issues we are seeing daily in the channel.
Issues and failures that affecting data flowing downstream should be addressed.
Once we have the higher priority errors addressed, we should improve the remaining alerts to be more helpful.

Plan/Design

  • Document the unique errors and alerts we are seeing in that channel
  • Update the channel name to follow the "notifications" convention
  • Address the current pipeline issues and bugs that are affecting data ingestion
    • Migrate edx.org Production Course Structure Airbyte ingestion to Iceberg Connector (1461)
    • Dagster course_xml asset materialization failures (6729 )
  • Add more specific alerts to Dagster (https://github.com/mitodl/hq/issues/6223)
    • Code Locations failing to load (Testing in QA)
    • Assets failing to materialize (WIP)
    • Job run failures or errors
    • Run queue errors
  • Add more specific alerts to Airbyte
@quazi-h quazi-h added Data Engineering product:data-platform Issues related to the Data Platform product product:infrastructure Issues related to application and operations infrastructure labels Feb 3, 2025
@quazi-h quazi-h self-assigned this Feb 3, 2025
@quazi-h
Copy link
Contributor Author

quazi-h commented Feb 7, 2025

I am tracking issues on this Google Sheet.

This issue has been resolved: #1461 by migrating the edX.org course structure syncs to the new Airbyte Iceberg connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Engineering product:data-platform Issues related to the Data Platform product product:infrastructure Issues related to application and operations infrastructure
Projects
None yet
Development

No branches or pull requests

1 participant