You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a cost overrun prevention measure our kubernetes work pool base job template has active_deadline_seconds set. If a pod is killed from under the job, it stays in the Running state forever and takes up slots in the work queue.
Describe the proposed behavior
I think it would make sense to track the pod state for kubenetes work pools, so that the jobs know to either start a new pod and re-run or report 'Failed' with a reason.
Likewise, it would be good if some job metadata (such as the pod name, resource requests, etc) were visible from the prefect UI.
Example Use
They can add a simple layer of protection for cost-overruns by setting active_deadline_seconds.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Thanks for the enhancement request @jpedrick-numeus! One idea we've had in this area is to have pods send heartbeats back to the Prefect server so that if the heartbeats stop, the server knows if a pod went down. In this case, we'd probably mark the flow run as CRASHED since the underlying infrastructure caused the failure. Does that sounds like it would work for your use case?
Also, where would you expect to see Kubernetes information for a flow run in the Prefect UI?
@desertaxle that would work for me. In my case I only need the pod state to be tracked so that prefect knows to move on to the next job in the Work Queue.
I think the details tab under https:///flow-runs/flow-run/?tab=Details would be perfect.
Describe the current behavior
As a cost overrun prevention measure our kubernetes work pool base job template has
active_deadline_seconds
set. If a pod is killed from under the job, it stays in theRunning
state forever and takes up slots in the work queue.Describe the proposed behavior
I think it would make sense to track the pod state for kubenetes work pools, so that the jobs know to either start a new pod and re-run or report 'Failed' with a reason.
Likewise, it would be good if some job metadata (such as the pod name, resource requests, etc) were visible from the prefect UI.
Example Use
They can add a simple layer of protection for cost-overruns by setting
active_deadline_seconds
.Additional context
No response
The text was updated successfully, but these errors were encountered: