You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What Operating System are you using (both controller, and any agents involved in the problem)?
Linux - 4.14.268-205.500.amzn2.x86_64
Reproduction steps
In real-life: Arrange/wait for Azure to not respond swiftly to API requests (it happens if Azure is "having a really bad day" and possibly for other reasons too).
In a dev environment: add a bunch of calls to Thread.sleep() into AzureVMAgentCleanUpTask.cleanup() such that they won't finish in a timely manner (you may also need to ignore/clear the thread's interrupt status to fully re-create the worst-case scenario)
Either way, the end effect was that calls to do anything with Azure "took forever", meaning that AzureVMAgentCleanUpTask.cleanup() did not complete.
Expected Results
If, when AzureVMAgentCleanUpTask.execute is called, it finds that a previous call is still in progress, then the just-triggered call should abort (and say so to the log).
It should NOT commence another cleanup() to run in parallel.
...and, if it times out waiting for the cleanup to complete, maybe it should abort the task it was waiting for instead of leaving it running.
Regardless of "how" this is achieved, the end effect should be that we never have more than one thread executing AzureVMAgentCleanUpTask.cleanup() at any one time.
Actual Results
Each time the Jenkins controller triggered the AzureVMAgentCleanUpTask, a new parallel run of AzureVMAgentCleanUpTask.cleanup() was commenced.
The Jenkins controller thread dump relealed that we had 5 of them running (in parallel), all in different phases of completion (whereas we should only have one at a time, at most).
It looks like AzureVMAgentCleanUpTask.execute() gets started every 5 minutes, and each run will wait up to 15 minutes for AzureVMAgentCleanUpTask.cleanup() to complete before logging an error
... but if the call to result.get(CLEAN_TIMEOUT_IN_MINUTES, TimeUnit.MINUTES) times out it will leave the cleanup thread running in the background.
...and it never checks to see if a cleanup is already running before starting a new one.
Anything else?
The as-is situation means that, if things are going bad, this plugin makes things worse, effectively DoS-attacking the controller and making the bad situation harder to diagnose & recover.
AzureVMAgentCleanUpTask needs to have some form of "am I already running? If so, don't start a new run" logic to prevent it from having more than one cleanup running in parallel (a AtomicBoolean field could suffice).
It could/should also add in some "if we timeout waiting for clean() to return then send an interrupt" logic (i.e. in the catch(TimeoutException block, include a call to result.cancel(true)) ... but this would only be a good idea if the cleanup code is able to react to interrupts nicely (if it doesn't then we'll simply have to wait for it to finish).
The text was updated successfully, but these errors were encountered:
Jenkins and plugins versions report
Environment
What Operating System are you using (both controller, and any agents involved in the problem)?
Linux - 4.14.268-205.500.amzn2.x86_64
Reproduction steps
In real-life: Arrange/wait for Azure to not respond swiftly to API requests (it happens if Azure is "having a really bad day" and possibly for other reasons too).
In a dev environment: add a bunch of calls to
Thread.sleep()
intoAzureVMAgentCleanUpTask.cleanup()
such that they won't finish in a timely manner (you may also need to ignore/clear the thread's interrupt status to fully re-create the worst-case scenario)Either way, the end effect was that calls to do anything with Azure "took forever", meaning that AzureVMAgentCleanUpTask.cleanup() did not complete.
Expected Results
If, when
AzureVMAgentCleanUpTask.execute
is called, it finds that a previous call is still in progress, then the just-triggered call should abort (and say so to the log).It should NOT commence another
cleanup()
to run in parallel....and, if it times out waiting for the cleanup to complete, maybe it should abort the task it was waiting for instead of leaving it running.
Regardless of "how" this is achieved, the end effect should be that we never have more than one thread executing
AzureVMAgentCleanUpTask.cleanup()
at any one time.Actual Results
Each time the Jenkins controller triggered the AzureVMAgentCleanUpTask, a new parallel run of AzureVMAgentCleanUpTask.cleanup() was commenced.
The Jenkins controller thread dump relealed that we had 5 of them running (in parallel), all in different phases of completion (whereas we should only have one at a time, at most).
It looks like
AzureVMAgentCleanUpTask.execute()
gets started every 5 minutes, and each run will wait up to 15 minutes forAzureVMAgentCleanUpTask.cleanup()
to complete before logging an error... but if the call to
result.get(CLEAN_TIMEOUT_IN_MINUTES, TimeUnit.MINUTES)
times out it will leave the cleanup thread running in the background....and it never checks to see if a cleanup is already running before starting a new one.
Anything else?
The as-is situation means that, if things are going bad, this plugin makes things worse, effectively DoS-attacking the controller and making the bad situation harder to diagnose & recover.
AzureVMAgentCleanUpTask needs to have some form of "am I already running? If so, don't start a new run" logic to prevent it from having more than one cleanup running in parallel (a
AtomicBoolean
field could suffice).It could/should also add in some "if we timeout waiting for clean() to return then send an interrupt" logic (i.e. in the
catch(TimeoutException
block, include a call toresult.cancel(true)
) ... but this would only be a good idea if the cleanup code is able to react to interrupts nicely (if it doesn't then we'll simply have to wait for it to finish).The text was updated successfully, but these errors were encountered: