-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VMs getting terminated mid-build after exactly 2 hours #634
Comments
Anything in the logs at the time it gets deleted? You can check the Azure VM Agent (Auto) logger in Log Recorders, possibly want to adjust the log level to FINE. |
It appears to be a cleanup task? This is the last thing I saw before the VM terminated.
|
And thats definitely occurs before the VM gets deleted? |
Sorry, I want to correct the previous message. I did a run again and was constantly refreshing the logs near the 2-hour mark and I noticed that the node was deleted by the idle timeout.
This is definitely not expected.
Could it be that there is some task in the code that triggers this check and interprets the node as "idle" when it's been running for 2+ hours? |
There's nothing in this plugin doing this as far as I can tell but the idleness information comes from Jenkins itself. Code being triggered is here: azure-vm-agents-plugin/src/main/java/com/microsoft/azure/vmagent/AzureVMCloudRetensionStrategy.java Line 88 in a3819d7
Which depends on: I'll try find some time to look at this but probably won't be this week, and especially given the 2 hours that it takes its hard to debug. |
lucky guess/hint: the job uses a flyweight executor Context (by chatgpt): The getIdleStartMilliseconds() method in the Computer class pertains to the state of heavyweight executors associated with that node. Flyweight executors operate independently and are not associated with a specific Computer instance. Therefore, the getIdleStartMilliseconds() method does not account for the state of flyweight executors. Update: |
no it won't be flyweight. can you confirm how you've got the job sleeping for 3 hours so I can do that same? Something like? Start-Sleep -Duration (New-TimeSpan -Hours 3) |
This is the pipeline I'm using pipeline {
agent none
stages {
stage('Launching nodes') {
agent {
label "windows-test"
}
steps {
echo "sleeping..."
sleep time:10840
}
}
}
} |
same issue pipeline {
} |
I think I’ve stumbled upon something interesting. We have 20 Jenkins instances, and the plugin appears to be working well on all of them except for 5. All instances have different legacy instance IDs, except for those 5, which share the same instance ID across 2 different groups: ID 1: ID 2: I’m wondering if this could be related to the issue we’re experiencing. Could you let me know if the plugin or Azure interacts with the legacy ID |
The problem occurs with the plugin because the plugin interacts with Jenkins. |
That will most likely be the issue, the Jenkins instance ID is whats used to identify ownership of the VM azure-vm-agents-plugin/src/main/java/com/microsoft/azure/vmagent/util/AzureUtil.java Line 521 in bf44f98
If they are the same it will think hey there's this VM that I don't know about, it must be an orphan I'll clean it up. |
Jenkins and plugins versions report
Environment
What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Ubuntu 22.04.5 LTS
Agent (Windows): Microsoft Windows 10 Enterprise
Agent (Linux): Ubuntu 20.04.6 LTS (Focal Fossa)
Reproduction steps
Expected Results
The VMs stay connected through jobs that are longer than 2 hours.
Actual Results
The VM is terminated after 2 hours, removed from both Jenkins and Azure, which interrupts and fails the job it was running.
Anything else?
This issue also mentions the same 2 hour timeout and was not resolved: Plugins cleanup actions is removing VMs that are in use and working. · Issue #481 · jenkinsci/azure-vm-agents-plugin
We have looked into if Windows Defender is causing the VMs to restart, but after disabling it, nodes still disconnect after 2 hours.
Are you interested in contributing a fix?
No response
The text was updated successfully, but these errors were encountered: