Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMs getting terminated mid-build after exactly 2 hours #634

Open
LaloBenitez opened this issue Feb 11, 2025 · 12 comments
Open

VMs getting terminated mid-build after exactly 2 hours #634

LaloBenitez opened this issue Feb 11, 2025 · 12 comments

Comments

@LaloBenitez
Copy link

Jenkins and plugins versions report

Environment
Jenkins: 2.462.1
OS: Linux - 4.14.348-265.565.amzn2.x86_64
Java: 21.0.4 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
Exclusion:0.15
Matrix-sorter-plugin:1.3
Office-365-Connector:4.21.5
PrioritySorter:5.1.0
analysis-model-api:12.4.0
ansicolor:1.0.4
ant:511.v0a_a_1a_334f41b_
antisamy-markup-formatter:162.v0e6ec0fcfcf6
any-buildstep:14.ve115ec1484f0
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-110.v77252fb_d4da_5
artifact-manager-s3:871.v72f7f642a_245
artifactory:4.0.8
asm-api:9.7-33.v4d23ef79fcc8
audit-trail:361.v82cde86c784e
authentication-tokens:1.119.v50285141b_7e1
authorize-project:1.7.2
aws-credentials:231.v08a_59f17d742
aws-global-configuration:130.v35b_7b_96f53c3
aws-java-sdk:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-api-gateway:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-autoscaling:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-cloudformation:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-cloudfront:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-codebuild:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-codedeploy:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-ec2:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-ecr:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-ecs:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-efs:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-elasticbeanstalk:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-elasticloadbalancingv2:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-iam:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-kinesis:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-lambda:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-logs:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-minimal:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-organizations:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-secretsmanager:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-sns:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-sqs:1.12.767-467.vb_e93f0c614b_6
aws-java-sdk-ssm:1.12.767-467.vb_e93f0c614b_6
aws-secrets-manager-credentials-provider:1.214.va_0a_d8268d068
aws-secrets-manager-secret-source:1.72.v61781b_35c542
azure-credentials:312.v0f3973cd1e59
azure-sdk:174.va_89c1df897d2
azure-vm-agents:966.v39138b_4ca_5cd
badge:1.13
basic-branch-build-strategies:81.v05e333931c7d
blueocean:1.27.9
blueocean-autofavorite:1.2.5
blueocean-bitbucket-pipeline:1.27.9
blueocean-commons:1.27.9
blueocean-config:1.27.9
blueocean-core-js:1.27.9
blueocean-dashboard:1.27.9
blueocean-display-url:2.4.3
blueocean-events:1.27.9
blueocean-git-pipeline:1.27.9
blueocean-github-pipeline:1.27.9
blueocean-i18n:1.27.9
blueocean-jira:1.27.9
blueocean-jwt:1.27.9
blueocean-personalization:1.27.9
blueocean-pipeline-api-impl:1.27.9
blueocean-pipeline-editor:1.27.9
blueocean-pipeline-scm-api:1.27.9
blueocean-rest:1.27.9
blueocean-rest-impl:1.27.9
blueocean-web:1.27.9
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_
branch-api:2.1178.v969d9eb_c728e
build-cause-run-condition:0.1
build-keeper-plugin:19.va_df8a_2c65123
build-monitor-plugin:1.14-883.vf620a_44eb_ec1
build-name-setter:2.4.3
build-timeout:1.33
build-user-vars-plugin:166.v52976843b_435
buildresult-trigger:0.18
built-on-column:1.4
caffeine-api:3.1.8-133.v17b_1ff2e0599
categorized-view:1.13
checks-api:2.2.0
chucknorris:159.vdfe649cb_9c37
claim:554.va_f9b_58b_0a_088
cloud-stats:336.v788e4055508b_
cloudbees-bitbucket-branch-source:888.v8e6d479a_1730
cloudbees-folder:6.942.vb_43318a_156b_2
clover:4.14.2.596.vb_4d6475e990b_
cobertura:1.17
code-coverage-api:4.99.0
command-launcher:115.vd8b_301cc15d0
commons-compress-api:1.26.1-2
commons-httpclient3-api:3.1-3
commons-lang3-api:3.16.0-82.ve2b_07d659d95
commons-text-api:1.12.0-129.v99a_50df237f7
computer-queue-plugin:1.7
conditional-buildstep:1.4.3
config-file-provider:973.vb_a_80ecb_9a_4d0
configuration-as-code:1836.vccda_4a_122a_a_e
configuration-as-code-groovy:1.1
copy-project-link:106.veb_028794a_844
copyartifact:749.vfb_dca_a_9b_6549
coverage:1.16.1
cppncss:1.2
credentials:1378.v81ef4269d764
credentials-binding:681.vf91669a_32e45
cucumber-reports:5.8.3
custom-tools-plugin:0.8
cvs:2.19.1
dashboard-view:2.508.va_74654f026d1
data-tables-api:2.0.8-1
datetime-constraint:0.1.2
delivery-pipeline-plugin:1.4.2
depgraph-view:1.0.5
description-setter:1.10-CUSTOM
disable-github-multibranch-status:1.2
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:443.v921729d5611d
docker-workflow:580.vc0c340686b_54
downstream-ext:73.vdda_16e6eb_0da
dropdown-viewstabbar-plugin:1.7
dtkit-api:3.0.2
durable-task:568.v8fb_5c57e8417
dynamic-axis:1.0.3
ec2:1688.v8c07e01d657f
echarts-api:5.5.0-1
ecutest:2.44
eddsa-api:0.3.0-4.v84c6f0f4969e
email-ext:1814.v404722f34263
embeddable-build-status:487.va_0ef04c898a_2
envinject:2.919.v009a_a_1067cd0
envinject-api:1.199.v3ce31253ed13
environment-variable-page-decoration:1.3.0
excludeMatrixParent:1.1
extended-choice-parameter:382.v5697b_32134e8
extended-read-permission:53.v6499940139e5
extensible-choice-parameter:1.8.1
external-monitor-job:215.v2e88e894db_f8
external-workspace-manager:1.3.1
ez-templates:1.3.5
favorite:2.221.v19ca_666b_62f5
file-parameters:339.v4b_cc83e11455
fitnesse:1.36
flexible-publish:0.16.1
font-awesome-api:6.5.2-1
forensics-api:2.4.0
git:5.3.0
git-client:5.0.0
git-forensics:2.1.0
git-parameter:0.9.19
git-server:126.v0d945d8d2b_39
github:1.40.0
github-api:1.321-468.v6a_9f5f2d5a_7e
github-autostatus:3.6.2
github-branch-source:1793.v1831e9c68d77
github-issues:1.2.4
github-oauth:597.ve0c3480fcb_d0
github-scm-trait-commit-skip:0.4.0
google-oauth-plugin:1.330.vf5e86021cb_ec
gradle:2.12
groovy:457.v99900cb_85593
groovy-postbuild:228.vcdb_cf7265066
gson-api:2.11.0-41.v019fcf6125dc
h2-api:11.1.4.199-30.v1c64e772f3a_c
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.36
hubot-steps:95.va_30176518a_5a
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
ivy:2.6
jackson2-api:2.17.0-379.v02de8ec9f64c
jacoco:3.3.6
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javadoc:280.v050b_5c849f69
javax-activation-api:1.2.0-7
javax-mail-api:1.6.2-10
jaxb:2.3.9-1
jdk-tool:80.v8a_dee33ed6f0
jenkins-design-language:1.27.14
jenkins-metrics-sender:0.01-SNAPSHOT (private-1926b506-?)
jenkins-multijob-plugin:630.v80676e0dc658
jersey2-api:2.44-151.v6df377fff741
jfrog:1.5.1
jira:3.13
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
jnr-posix-api:3.1.19-2
job-dsl:1.87
job-restrictions:0.8
jobConfigHistory:1229.v3039470161a_d
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery:1.12.4-1
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
junit-attachments:239.v9e003a_c80a_8c
keepSlaveOffline:1.0
kubernetes:4304.v1b_39d4f98210
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:190.v03c305394deb_
label-verifier:105.vf9d080687b_92
ldap:725.v3cb_b_711b_1a_ef
locale:519.v4e20f313cfa_f
lockable-resources:1255.vf48745da_35d0
log-parser:2.3.5
logstash:2.5.0218.v0a_ff8fefc12b_
mailer:472.vf7c289a_4b_420
mapdb-api:1.0.9-40.v58107308b_7a_7
mask-passwords:173.v6a_077a_291eb_5
matrix-auth:3.2.2
matrix-combinations-parameter:1.3.3
matrix-project:838.v4d7b_7b_f9b_d4b_
maven-plugin:3.23
measurement-plots:0.1
mercurial:1260.vdfb_723cdcc81
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.13.2-125.v200281b_61d59
mina-sshd-api-core:2.13.2-125.v200281b_61d59
monitoring:1.99.0
multi-branch-priority-sorter:1.0
multibranch-action-triggers:1.8.10
multibranch-build-strategy-extension:51.v88f14e2a_4075
nested-view:1.34
next-build-number:1.8
node-iterator-api:55.v3b_77d4032326
nodelabelparameter:1.12.0
oauth-credentials:0.653.v14cf2088e950
offlineonfailure-plugin:1.1-SNAPSHOT (private-03/11/2016 13:37-gsix)
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pam-auth:1.11
parameterized-scheduler:277.v61a_4b_a_49a_c5c
parameterized-trigger:806.vf6fff3e28c3e
pipeline-aggregator-view:104.v94a_e5f6cdb_c3
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-github:2.8-159.09e4403bc62f
pipeline-github-lib:61.v629f2cc41d83
pipeline-githubnotify-step:49.vf37bf92d2bc8
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:730.ve57b_34648c63
pipeline-input-step:495.ve9c153f6067b_
pipeline-maven:1421.v610fa_b_e2d60e
pipeline-maven-api:1421.v610fa_b_e2d60e
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2205.vc9522a_9d5711
pipeline-model-definition:2.2205.vc9522a_9d5711
pipeline-model-extensions:2.2205.vc9522a_9d5711
pipeline-multibranch-defaults:2.1
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2205.vc9522a_9d5711
pipeline-stage-view:2.34
pipeline-timeline:1.0.3
pipeline-utility-steps:2.17.0
plain-credentials:183.va_de8f1dd5a_2b_
plot:2.1.12
plugin-usage-plugin:4.5
plugin-util-api:4.1.0
pollscm:1.5
postbuild-task:1.9
postbuildscript:3.3.0-654.v67cf36130d78
preSCMbuildstep:71.v1f2990a_37e27
prism-api:1.29.0-15
progress-bar-column-plugin:11.vdef198c2d6c1
project-description-setter:1.2
project-health-report:1.2
promoted-builds:957.vf5b_cee587563
publish-over:0.22
publish-over-cifs:0.16
pubsub-light:1.18
python:1.3
random-string-parameter:1.0
read-only-configurations:1.10
rebuild:332.va_1ee476d8f6d
release:2.19
remote-file:1.24
resource-disposer:0.23
rich-text-publisher-plugin:1.5
role-strategy:743.v142ea_b_d5f1d3
run-condition:1.7
saferestart:0.7
saml:4.464.vea_cb_75d7f5e0
scm-api:696.v778d637b_a_762
script-security:1354.va_70a_fe478c7f
sectioned-view:1.27
secure-requester-whitelist:70.ve2a_3c4a_dc9f5
seed:2.1.4
shelve-project-plugin:3.2
show-build-parameters:1.0
sidebar-link:2.4.1
simple-build-for-pipeline:0.2
simple-theme-plugin:191.vcd207ef9dd24
slave-setup:1.16
snakeyaml-api:2.2-121.v5a_68b_9300b_d4
sse-gateway:1.27
ssh-agent:376.v8933585c69d3
ssh-credentials:343.v884f71d78167
ssh-slaves:2.973.v0fa_8c0dea_f9f
sshd:3.330.vc866a_8389b_58
statusmonitor:1.3
structs:338.v848422169819
subversion:1275.va_7b_014f3fc2c
summary_report:1.15
test-stability:2.3
throttle-concurrents:2.14
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
urltrigger:1.02
validating-string-parameter:183.v3748e79b_9737
variant:60.v7290fc0eb_b_cd
versioncolumn:243.vda_c20eea_a_8a_f
view-job-filters:382.vdf2d5e3f02f0
warnings-ng:11.4.0
webhook-step:342.v620877effe14
workflow-aggregator:600.vb_57cdd26fdd7
workflow-api:1336.vee415d95c521
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3922.va_f73b_7c4246b_
workflow-durable-task-step:1364.v2fd76fb_6fd41
workflow-job:1436.vfa_244484591f
workflow-multibranch:795.ve0cb_1f45ca_9a_
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:678.v3ee58b_469476
workflow-support:920.v59f71ce16f04
ws-cleanup:0.46
xtrigger-api:1.0
xunit:3.1.3
zentimestamp:4.2

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 22.04.5 LTS
Agent (Windows): Microsoft Windows 10 Enterprise
Agent (Linux): Ubuntu 20.04.6 LTS (Focal Fossa)

Reproduction steps

  1. Create a simple job that spins up a Windows VM using a gallery image. Have the job sleep for 3 hours.
  2. After exactly 2 hours, the VM will be randomly terminated even though the VM is technically still in use and not idle (the VM is sitting on a sleep from the Jenkins job).

Expected Results

The VMs stay connected through jobs that are longer than 2 hours.

Actual Results

The VM is terminated after 2 hours, removed from both Jenkins and Azure, which interrupts and fails the job it was running.

Anything else?

This issue also mentions the same 2 hour timeout and was not resolved: Plugins cleanup actions is removing VMs that are in use and working. · Issue #481 · jenkinsci/azure-vm-agents-plugin

We have looked into if Windows Defender is causing the VMs to restart, but after disabling it, nodes still disconnect after 2 hours.

Are you interested in contributing a fix?

No response

@timja
Copy link
Member

timja commented Feb 11, 2025

Anything in the logs at the time it gets deleted?

You can check the Azure VM Agent (Auto) logger in Log Recorders, possibly want to adjust the log level to FINE.

@LaloBenitez
Copy link
Author

It appears to be a cleanup task? This is the last thing I saw before the VM terminated.

Started Azure VM Agents Clean Task
Feb 12, 2025 12:29:58 AM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Start
Feb 12, 2025 12:29:58 AM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Running clean with 15 minute timeout
Feb 12, 2025 12:29:58 AM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Beginning
Feb 12, 2025 12:29:58 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists
Checking VM exists for myVm378be0
Feb 12, 2025 12:29:58 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists
myVm378be0 doesnt exist
Feb 12, 2025 12:29:58 AM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Node myVm378be00 doesnt exist, removing

@timja
Copy link
Member

timja commented Feb 12, 2025

And thats definitely occurs before the VM gets deleted?

@LaloBenitez
Copy link
Author

LaloBenitez commented Feb 12, 2025

Sorry, I want to correct the previous message. I did a run again and was constantly refreshing the logs near the 2-hour mark and I noticed that the node was deleted by the idle timeout.

Feb 12, 2025 10:32:25 AM INFO com.microsoft.azure.vmagent.AzureVMCloudRetensionStrategy check
Idle timeout reached for agent: myVm585f0, action: delete
Feb 12, 2025 10:32:25 AM INFO com.microsoft.azure.vmagent.AzureVMAgent deprovision
Deprovision called for agent myVm585f0, for reason: Node is being deleted by Jenkins after idle timeout
Feb 12, 2025 10:32:26 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate isVMAliveOrHealthy
Status PowerState/running

This is definitely not expected.

  • We have the retention strategy set to "Azure VM Idle Retention Strategy" and it is set to 5 minutes.
  • The node was still in use. The job still had 1 hour until it was finished.

Could it be that there is some task in the code that triggers this check and interprets the node as "idle" when it's been running for 2+ hours?

@timja
Copy link
Member

timja commented Feb 12, 2025

There's nothing in this plugin doing this as far as I can tell but the idleness information comes from Jenkins itself.

Code being triggered is here:

LOGGER.log(Level.INFO, "Idle timeout reached for agent: {0}, action: {1}",

Which depends on:
agentNode.getIdleStartMilliseconds()

I'll try find some time to look at this but probably won't be this week, and especially given the 2 hours that it takes its hard to debug.

@psciborek
Copy link
Contributor

psciborek commented Feb 13, 2025

lucky guess/hint: the job uses a flyweight executor

Context (by chatgpt): The getIdleStartMilliseconds() method in the Computer class pertains to the state of heavyweight executors associated with that node. Flyweight executors operate independently and are not associated with a specific Computer instance. Therefore, the getIdleStartMilliseconds() method does not account for the state of flyweight executors.

Update:
I do not entirely agree that they run independently of any computers. In my experience, they can be attached to a certain node for some job types — for example, when waiting for sub-jobs to finish.
If you find this useful, please consider verifying this hypothesis.

@timja
Copy link
Member

timja commented Feb 13, 2025

no it won't be flyweight.

can you confirm how you've got the job sleeping for 3 hours so I can do that same?

Something like?

Start-Sleep -Duration (New-TimeSpan -Hours 3)

@LaloBenitez
Copy link
Author

This is the pipeline I'm using

pipeline {
    agent none
    stages {
        stage('Launching nodes') {
            agent {
              label "windows-test"
            }
            steps {
                echo "sleeping..."
                sleep time:10840
            }
        }
    }
}

@cbautomation
Copy link

same issue

pipeline {
agent {
label 'azure_vm'
}

stages {
    stage('Echo Hello World') {
        steps {
            script {
                def logs = []
                while (true) {
                    def logEntry = 'hello world'
                    echo logEntry
                    logs.add(logEntry)
                    if (logs.size() > 20) {
                        logs.remove(0) // delete old logs
                    }
                    sleep(time: 1, unit: 'SECONDS')
                }
            }
        }
    }
}

}

@LaloBenitez
Copy link
Author

I think I’ve stumbled upon something interesting. We have 20 Jenkins instances, and the plugin appears to be working well on all of them except for 5.

All instances have different legacy instance IDs, except for those 5, which share the same instance ID across 2 different groups:

ID 1:
Instances a, b, c

ID 2:
Instances d, e

I’m wondering if this could be related to the issue we’re experiencing. Could you let me know if the plugin or Azure interacts with the legacy ID Jenkins.instance.legacyInstanceID?

@cbautomation
Copy link

The problem occurs with the plugin because the plugin interacts with Jenkins.
We have a script that takes over the provisioning of the VMs via Azure CLI without Azure VM Agents plug-in. There the VMs run for more than 2 hours without problems.

@timja
Copy link
Member

timja commented Feb 19, 2025

I’m wondering if this could be related to the issue we’re experiencing. Could you let me know if the plugin or Azure interacts with the legacy ID Jenkins.instance.legacyInstanceID?

That will most likely be the issue, the Jenkins instance ID is whats used to identify ownership of the VM

If they are the same it will think hey there's this VM that I don't know about, it must be an orphan I'll clean it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants