Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement][Master] Allow Recovery of Failed Tasks in Running Workflows #16606

Open
2 of 3 tasks
sketchmind opened this issue Sep 11, 2024 · 1 comment
Open
2 of 3 tasks
Labels
discussion discussion improvement make more easy to user or prompt friendly Stale

Comments

@sketchmind
Copy link
Contributor

sketchmind commented Sep 11, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

In DolphinScheduler's scheduling strategy where workflows continue after task failures, we encountered a limitation with the "Recovery Failed" feature. Specifically, if a task within a workflow fails, and other tasks are still running for a period of time, the "Recovery Failed" option becomes unavailable. We can only recover the workflow after the entire workflow fails, leading to delays in completing the failed task and its subsequent tasks.

For example, in the attached scenario (see image):
image
Task B1 has failed, while other tasks like A1 (which Workflow2 depends on) continue running. If we wait for Workflow1 to fail before recovering the failed task (B1), B1's completion will be delayed. However, if we terminate Workflow1 immediately and then recover it, the dependent workflow (Workflow2) would unnecessarily fail due to A1 being killed, requiring us to recover Workflow2 as well.

Proposed Feature:
We suggest adding a feature that allows us to recover failed tasks within a running workflow. This would provide a way to proactively recover tasks like B1 before the entire workflow fails, giving workflows that would otherwise fail the opportunity to complete successfully.

This enhancement could save time and prevent cascading failures in dependent workflows. It would be particularly useful in scenarios where we can foresee a task's failure leading to the workflow’s eventual failure.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@sketchmind sketchmind added improvement make more easy to user or prompt friendly Waiting for reply Waiting for reply labels Sep 11, 2024
@SbloodyS SbloodyS added discussion discussion and removed Waiting for reply Waiting for reply labels Sep 11, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion discussion improvement make more easy to user or prompt friendly Stale
Projects
None yet
Development

No branches or pull requests

2 participants