[Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ #16580

13813586515 · 2024-09-04T05:03:52Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

Seatunnel本身有checkpoint的机制，海豚调度也存在恢复容错的机制，这两者目前的结合不完善存在一定的bug，验证如下
1.通过海豚调度部署Seatunnel的cdc任务，
2.模拟意外宕机：杀死海豚调度的任务进程（此时Seatunnelclient任务并没有被杀死）
3.启动海豚调度
4.此时海豚调度会启动容错恢复机制，会重新提交新的Seatunnelclient任务
5.当Seatunnelclient任务较多时，会依次被恢复，导致同样的Seatunnel task被创建，如果任务很多的话，会直接导致cpu短时间内暴涨最终导致雪崩

What you expected to happen

1.海豚的恢复容错目前看来是并发的，考虑到任务的数量，是否应该在恢复容错时控制并发甚至按照串行方式恢复
2.调度意外宕机，再次启动时，发现st任务没有kill应该无需恢复

How to reproduce

1.通过海豚调度部署Seatunnel的cdc任务，
2.模拟意外宕机：杀死海豚调度的任务进程（此时Seatunnelclient任务并没有被杀死）
3.启动海豚调度
4.此时海豚调度会启动容错恢复机制，会重新提交新的Seatunnelclient任务
5.当Seatunnelclient任务较多时，会依次被恢复，导致同样的Seatunnel task被创建，如果任务很多的话，会直接导致cpu短时间内暴涨最终导致雪崩

Anything else

No response

Version

dev

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

github-actions · 2024-09-04T05:04:07Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

Seatunnel itself has a checkpoint mechanism, and Dolphin Scheduling also has a recovery fault tolerance mechanism. The current combination of the two is imperfect and there are certain bugs. The verification is as follows

Deploy Seatunnel’s cdc task through Dolphin scheduling,
Simulate unexpected downtime: kill the task process scheduled by Dolphin (the Seatunnelclient task is not killed at this time)
Start dolphin scheduling
At this time, Dolphin Scheduling will start the fault-tolerant recovery mechanism and resubmit the new Seatunnelclient task.
When there are many Seatunnelclient tasks, they will be restored one after another, causing the same Seatunnel tasks to be created. If there are many tasks, it will directly cause the CPU to skyrocket in a short period of time and eventually lead to an avalanche.

What you expected to happen

Dolphin's recovery fault tolerance currently seems to be concurrent. Considering the number of tasks, should we control concurrency or even recover in a serial manner when recovering fault tolerance?
The scheduler crashed unexpectedly. When it was started again, it was found that the st task was not killed and there was no need to restore it.

How to reproduce

Deploy Seatunnel’s cdc task through Dolphin scheduling,
Simulate unexpected downtime: kill the task process scheduled by Dolphin (the Seatunnelclient task is not killed at this time)
Start dolphin scheduling
At this time, Dolphin Scheduling will start the fault-tolerant recovery mechanism and resubmit the new Seatunnelclient task.
When there are many Seatunnelclient tasks, they will be restored one after another, causing the same Seatunnel tasks to be created. If there are many tasks, it will directly cause the CPU to skyrocket in a short period of time and eventually lead to an avalanche.

Anything else

No response

Version

dev

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

13813586515 added bug Something isn't working Waiting for reply Waiting for reply labels Sep 4, 2024

SbloodyS added help wanted Extra attention is needed improvement make more easy to user or prompt friendly and removed bug Something isn't working Waiting for reply Waiting for reply labels Sep 4, 2024

SbloodyS changed the title ~~[Bug] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？~~ [Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ #16580

[Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ #16580

13813586515 commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

[Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ #16580

[Improvement] [Seatunnel JOB] Can not adapt checkpoint in Seatunnel？ #16580

Comments

13813586515 commented Sep 4, 2024

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

github-actions bot commented Sep 4, 2024

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct