Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugins cleanup actions is removing VMs that are in use and working. #481

Closed
limeman40 opened this issue Nov 17, 2023 · 25 comments
Closed

Comments

@limeman40
Copy link

Jenkins and plugins versions report

Environment
Jenkins: 2.432
OS: Linux - 6.2.0-1016-azure
Java: 11.0.20.1 - Ubuntu (OpenJDK 64-Bit Server VM)
---
Office-365-Connector:4.20.2
ace-editor:1.1
ansible:285.v2f044b_eb_7a_3e
ant:497.v94e7d9fffa_b_9
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.2.1-1.1
async-http-client:1.9.40.0
authentication-tokens:1.53.v1c90fd9191a_b_
azure-acs:1.0.4
azure-ad:433.v1982e2b_b_4a_fe
azure-app-service:1.0.2
azure-artifact-manager:133.vf94ad3455cdc
azure-cli:0.9
azure-commons:1.1.3
azure-container-agents:253.vd2f5cd5c5040
azure-container-registry-tasks:0.6.5
azure-credentials:293.vb_d506148f506
azure-credentials-ext:1.0
azure-function:0.3.3
azure-keyvault:228.va_31b_a_451e7d6
azure-sdk:157.v855da_0b_eb_dc2
azure-vm-agents:883.v63c930b_025dc
azure-vmss:0.2.4
badge:1.9.1
bitbucket:223.vd12f2bca5430
blackduck-detect:9.0.0
block-queued-job:0.2.0
blueocean-bitbucket-pipeline:1.27.9
blueocean-commons:1.27.9
blueocean-core-js:1.27.9
blueocean-jwt:1.27.9
blueocean-pipeline-api-impl:1.27.9
blueocean-pipeline-scm-api:1.27.9
blueocean-rest:1.27.9
blueocean-rest-impl:1.27.9
blueocean-web:1.27.9
bootstrap4-api:4.6.0-6
bootstrap5-api:5.3.2-2
bouncycastle-api:2.29
branch-api:2.1135.v8de8e7899051
build-user-vars-plugin:1.9
caffeine-api:3.1.8-133.v17b_1ff2e0599
changes-since-last-success:0.6
checks-api:2.0.2
cloud-stats:320.v96b_65297a_4b_b_
cloudbees-bitbucket-branch-source:848.v42c6a_317eda_e
cloudbees-folder:6.858.v898218f3609d
command-launcher:107.v773860566e2e
commons-httpclient3-api:3.1-3
commons-lang3-api:3.13.0-62.v7d18e55f51e2
commons-text-api:1.11.0-94.v3e1f4a_926e49
conditional-buildstep:1.4.3
config-file-provider:959.vcff671a_4518b_
copyartifact:722.v0662a_9b_e22a_c
credentials:1309.v8835d63eb_d8a_
credentials-binding:642.v737c34dea_6c2
crx-content-package-deployer:1.9
data-tables-api:1.13.6-5
datadog:5.6.0
digitalocean-plugin:1.3.1
display-url-api:2.200.vb_9327d658781
docker-commons:439.va_3cb_0a_6a_fb_29
docker-java-api:3.3.1-79.v20b_53427e041
durable-task:523.va_a_22cf15d5e0
echarts-api:5.4.0-7
envinject:2.908.v66a_774b_31d93
envinject-api:1.199.v3ce31253ed13
extended-read-permission:53.v6499940139e5
extensible-choice-parameter:1.8.1
external-monitor-job:215.v2e88e894db_f8
favorite:2.4.3
font-awesome-api:6.4.2-1
generic-webhook-trigger:1.88.0
git:5.2.1
git-client:4.5.0
git-parameter:0.9.19
git-server:99.va_0826a_b_cdfa_d
github:1.37.3.1
github-api:1.316-451.v15738eef3414
github-branch-source:1741.va_3028eb_9fd21
github-pullrequest:0.5.0
gitlab-api:5.3.0-91.v1f9a_fda_d654f
gitlab-branch-source:684.vea_fa_7c1e2fe3
google-metadata-plugin:0.5
google-oauth-plugin:1.318.vb_39c5db_e3041
gradle:2.9
handlebars:3.0.8
handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953
htmlpublisher:1.32
instance-identity:185.v303dc7c645f9
ionicons-api:56.v1b_1c8c49374e
jackson2-api:2.15.3-372.v309620682326
jakarta-activation-api:2.0.1-3
jakarta-mail-api:2.0.1-3
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-6
javax-mail-api:1.6.2-9
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.9
jersey2-api:2.41-133.va_03323b_a_1396
jjwt-api:0.11.5-77.v646c772fddb_0
jnr-posix-api:3.1.18-1
jobConfigHistory:1229.v3039470161a_d
jquery:1.12.4-1
jquery-detached:1.2.1
jquery3-api:3.7.1-1
jsch:0.2.8-65.v052c39de79b_2
junit:1240.vf9529b_881428
kubernetes-cd:2.3.1
kubernetes-client-api:6.8.1-224.vd388fca_4db_3b_
kubernetes-credentials:0.11
label-linked-jobs:6.0.1
ldap:711.vb_d1a_491714dc
lockable-resources:1185.v0c528656ce04
mailer:463.vedf8358e006b_
mapdb-api:1.0.9-28.vf251ce40855d
matrix-auth:3.2.1
matrix-project:818.v7eb_e657db_924
maven-plugin:3.23
mercurial:1260.vdfb_723cdcc81
metrics:4.2.18-442.v02e107157925
mina-sshd-api-common:2.11.0-86.v836f585d47fa_
mina-sshd-api-core:2.11.0-86.v836f585d47fa_
momentjs:1.1.1
msbuild:1.30
nexus-jenkins-plugin:3.16.510.v4d23e22cf563
node-iterator-api:55.v3b_77d4032326
node-sharing-executor:2.0.8
oauth-credentials:0.646.v02b_66dc03d2e
okhttp-api:4.11.0-157.v6852a_a_fa_ec11
pam-auth:1.10
pipeline-build-step:516.v8ee60a_81c5b_9
pipeline-graph-analysis:202.va_d268e64deb_3
pipeline-groovy-lib:689.veec561a_dee13
pipeline-input-step:477.v339683a_8d55e
pipeline-milestone-step:111.v449306f708b_7
pipeline-model-api:2.2151.ve32c9d209a_3f
pipeline-model-definition:2.2151.ve32c9d209a_3f
pipeline-model-extensions:2.2151.ve32c9d209a_3f
pipeline-rest-api:2.34
pipeline-stage-step:305.ve96d0205c1c6
pipeline-stage-tags-metadata:2.2151.ve32c9d209a_3f
pipeline-stage-view:2.34
pipeline-utility-steps:2.16.0
plain-credentials:143.v1b_df8b_d3b_e48
plugin-util-api:3.6.0
popper-api:1.16.1-3
popper2-api:2.11.6-4
powershell:2.1
promoted-builds:936.va_571a_a_b_f8da_5
pubsub-light:1.18
rebuild:320.v5a_0933a_e7d61
resource-disposer:0.23
run-condition:1.7
saml:4.429.v9a_781a_61f1da_
scm-api:683.vb_16722fb_b_80b_
script-security:1275.v23895f409fb_d
service-fabric:1.6
shelve-project-plugin:3.2
snakeyaml-api:2.2-111.vc6598e30cc65
ssh:2.6.1
ssh-agent:346.vda_a_c4f2c8e50
ssh-credentials:308.ve4497b_ccd8f4
ssh-slaves:2.916.vd17b_43357ce4
ssh2easy:1.6
sshd:3.312.v1c601b_c83b_0e
stashNotifier:1.439.v202358346a_7d
strict-crumb-issuer:2.1.1
structs:325.vcb_307d2a_2782
synopsys-coverity:3.0.3
thinBackup:1.18
timestamper:1.26
token-macro:384.vf35b_f26814ec
trilead-api:2.84.v72119de229b_7
uno-choice:2.8.1
variant:60.v7290fc0eb_b_cd
windows-azure-storage:386.v673495b0a5de
windows-slaves:1.8.1
workflow-aggregator:596.v8c21c963d92d
workflow-api:1283.v99c10937efcb_
workflow-basic-steps:1042.ve7b_140c4a_e0c
workflow-cps:3806.va_3a_6988277b_2
workflow-cps-global-lib:609.vd95673f149b_b
workflow-durable-task-step:1289.v4d3e7b_01546b_
workflow-job:1360.vc6700e3136f5
workflow-multibranch:756.v891d88f2cd46
workflow-scm-step:415.v434365564324
workflow-step-api:639.v6eca_cd8c04a_a_
workflow-support:865.v43e78cc44e0d
ws-cleanup:0.45

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 22.04.3 LTS
Agent: Windows Server 2019 Datacenter
Agent: Ubuntu 22.04.3 LTS

Reproduction steps

  1. Have the plugin spin up any VM gallery image using any Idle Retention Strategy
  2. Wait about 2 hours and the cleanup task will remove the VM for no reason
  3. Have to watch Jenkins and Pull Request runs from the multibranch pipeline and stop/restart them as the Vms agents pop offline during builds###

Expected Results

The VMs stay connected for the Idle Retention Strategy choose

Actual Results

VMs disconnect at various intervals and I have to baby sit builds all day not a great use of my time.

Anything else?

Things have gotten better after I switched to "Azure VM Idle Retention Strategy" from "Azure VM Pool Retention Strategy (Experimental)"

Previously it was happening more often like every 20 to 30 minutes now it happens every 2 hours.

I opened a support case with Microsoft and they looked at the back end to see what was going on. I put a delete policy lock on the resource group level to allow them to see what is trying to delete the VMs. It was the application we setup for this integration the app ID lines up with what he is seeing in the logs.

This used to work fine but not it has been broken for a while. One other thing to note is I had to do a restore of this VM from snapshot due to a delete issue with the files for Jenkins. Not sure how that would cause this to happen though. I think it is more likely it is a new bug in this plugin.

@limeman40 limeman40 added the bug label Nov 17, 2023
@limeman40
Copy link
Author

I hope someone will look into this. This issue is causing me to have to baby sit Jenkins builds all day

@limeman40
Copy link
Author

Can you please tell me how the cleanup process works? I am seeing this in the activity logs:

"properties": { "title": "Down: Virtual machine has been unavailable for 15 minutes", "details": "Unknown", "currentHealthStatus": "Unavailable", "previousHealthStatus": "Unavailable", "type": "Downtime", "cause": "UserInitiated"

I feel like there some timeout that is 15 minutes and it is cleaning up VMs that are still in use.

@timja
Copy link
Member

timja commented Nov 20, 2023

These are the three tasks that are run:
https://github.com/search?q=repo%3Ajenkinsci%2Fazure-vm-agents-plugin%20AsyncPeriodic&type=code

Code shouldn't be too hard to follow, the first and third will be the most interesting I think.

@limeman40
Copy link
Author

Sure but have any ideas what could cause this race condition.

My co-worker had this thought “it wouldn't be the quotas im wonder if we are hitting the cap set in the plugin”

is there any caps set in the plugin. I feel like without more guidance I am looking for a needle in a haystack.

I do have some Java experience but it has been a bit since I coded anything in it.

@limeman40
Copy link
Author

It is defiantly something with the plugin. I have a ticket open with Azure on this issue. They have seen in the logs the application ID asking for the VM to be deleted is the app ID we setup for this integration into Azure. I am possible look into this next week but its been a while since I have done anything in Java and am unsure of whta I will be able to work out

@limeman40
Copy link
Author

Still have no idea on this.. It seems to be some kind of race condition.. Something with the cleanup aspect of the code.. VMs only typically stay connected for about 2 hours tops and then things go sideways..

I am surprised nobody else has run into this issue.

@timja
Copy link
Member

timja commented Nov 29, 2023

Unsure, we would sometimes have ours up for many hours and definitely don't hit an issue like this

@limeman40
Copy link
Author

The only way I been to collateral anything is I see message in the activity logs in Azure health events saying the VM was not around for 10 to 15 minutes.

However I am not seeing anything in the plugin logs says it will remove that VM so I am not sure what is happening. Is there anything besides the cleanup functions that could cause this.

It is a shame we have used this plugin for probably 2 years without issue and now all the sudden there some problem.

I have even tried to pull the plugin out completely and put it back and the issue persists.

@timja
Copy link
Member

timja commented Nov 29, 2023

You could maybe add logging here:

public void onClosed(Channel channel, IOException cause) {
jschChannel.disconnect();
cleanupSession.disconnect();
}

to see why it's closing.

Is there anything in the agent log (may be hard to get)?


Moon shot but maybe an inbound agent would work better? They should be more resilient.

@limeman40
Copy link
Author

Can you give me a few more details on how I would add logging to this section?

Could you give me an example of what this would look like code wise? I guess I can lookup how to create an HPI out of my changes.

Also are you suggesting I use a JNLP connection instead? I can try it tomorrow and see if I have better luck just making sure I understand what you are suggesting.

@timja
Copy link
Member

timja commented Nov 30, 2023

Can you give me a few more details on how I would add logging to this section?

Could you give me an example of what this would look like code wise? I guess I can lookup how to create an HPI out of my changes.

Add something similar to https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d35d6b366b733b5475a354d0f39815051bbecf04/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L261C13-L261C89
above line 248, then run mvn clean install -P quick-build the hpi will be in the target directory.

Also are you suggesting I use a JNLP connection instead? I can try it tomorrow and see if I have better luck just making sure I understand what you are suggesting.

Yes, JNLP although it has been renamed to Inbound agent.
Example init scripts are in these folders:
https://github.com/jenkinsci/azure-vm-agents-plugin/tree/master/docs/init-scripts

@limeman40
Copy link
Author

Sorry little confused on inbound agent I would think if you tell it you want it to use inbound agent it would like just use that init script to set it up.

I will give both things a try to day and see if I can gain more details on this issue.

@timja
Copy link
Member

timja commented Nov 30, 2023

Sorry little confused on inbound agent I would think if you tell it you want it to use inbound agent it would like just use that init script to set it up.

I will give both things a try to day and see if I can gain more details on this issue.

Maybe it could, currently it's setup to be quite flexible so you can configure the agent however you like and an example is given to make it easy for you to setup.

@limeman40
Copy link
Author

Please correct me if I am wrong but I would think selecting this option would automatically have it just use the PS1 scripts to connect right? If you tell it to use SSH it does all that script init stuff for you:

Screenshot from 2023-11-30 14-16-59

Is this not the case if I choose this option. I am a little confused and need more details on how to property setup Inbound connections for this.

@timja
Copy link
Member

timja commented Nov 30, 2023

No inbound is more complicated as Jenkins doesn't reach out to your agent at all.
The help for the launch method should explain it more.

The init script is uploaded to a storage account and the run on VM startup and either that or something in the VM image needs to do things like include the remoting jar file and a service.

Are you using Windows agents btw? (just from looking at that screenshot), I don't have much experience with them, although the Jenkins project does use them quite a lot without this issue as far I know, although I think they use them as 'one-shot' agents and don't do multiple builds on them

@limeman40
Copy link
Author

limeman40 commented Nov 30, 2023

I have both windows and linux agents. Most of my stuff runs on windows.

I had another question so I choose "Idle Retention Strategy" per your suggestion. However even if I tell it 0 for timeout the VMs only last around 2 hours and still just gets removed.

Is it also possible there is something wrong with the way this timeout is being set in the code?

Currently we are using a mix of both windows agents and linux agents being setup via SSH connection. All was working fine for 2 years now all the sudden this issue has come up and I can not for the life of me figure out why.

I have looked through the Activity logs in Azure as well as the various logs in Jenkins it is not showing me much as to why this happens.

If you have any other ideas please let me know I am at a loss right now. It is also getting tiresome having to baby site the Jenkins server all day long.

@timja
Copy link
Member

timja commented Dec 1, 2023

0 will mean it won't go idle:
https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/src/main/java/com/microsoft/azure/vmagent/AzureVMCloudRetensionStrategy.java#L77

You would see this log line anyway:
https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/src/main/java/com/microsoft/azure/vmagent/AzureVMCloudRetensionStrategy.java#L88-L89

You should be able to get help from MS Support on this as these are officially supported (for another 2 months anyway)

@limeman40
Copy link
Author

As I have previously stated I have a support ticket open. They have looked through the logs on their end they are saying the Enterprise Application ID that is asking for the VMs deletion is the one we have setup as the service principle for this plugin.

The support person has agreed to have a Teams meeting with me to discuss this issue.

You are correct it is curious to me that I am not seeing it saying it going to cleanup a VM in the logs in Jenkins in the plugin. So I am not totally believing what support is saying.

I am going to try to get a inbound agent configuration working today and see if performs any better. I have not gotten around to putting a try or catch around the function call. I will see if I can do this today.

@limeman40
Copy link
Author

I tried everything to get this inbound connection setup nothing works.

Would it be possible for you to please try the steps on your end figure out what they are and report back. Nothing I am trying is working I am started to get very frustrated with this issue.

Also what about this issue 484. This seems kind of related to my issue possibly.

@timja
Copy link
Member

timja commented Dec 2, 2023

have you tried logging into the VM that's connecting to the Jenkins controller as an inbound agent?
Linux agents log to this path:
https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/docs/init-scripts/linux-inbound-agent.sh#L38

Basically you:

  1. set the VM template to inbound agent
  2. Configure an init script with this: https://raw.githubusercontent.com/jenkinsci/azure-vm-agents-plugin/master/docs/init-scripts/linux-inbound-agent.sh

And log in to the agent before it gets deleted and check the logs if any issue


i don't think #484 is related but unsure without looking closer.

@limeman40
Copy link
Author

@timja Not sure if you saw my note a couple weeks back.

We had an issue where I deleted some files needed for Jenkins and I did a restore on the Jenkins Controller from an Azure snapshot it all seems to work fine.

However could this have caused any issues with this plugin? I have even tried to pull out the plugin completly and readd it back in. Just curious if this could have any bearing on the issue I am facing.

Just trying to go over all avenues to try to figure out what is going on here.

@timja
Copy link
Member

timja commented Dec 5, 2023

Not really sure, maybe another plugin or library could be conflicting.

You could try updating all plugins / removing some, only a guess though based on never having seen this before

@limeman40
Copy link
Author

I tried to reinstall each plugin via the HPI file that seems to not have done anything this evening I see this error in the log which is perplexing to me:

[a1be7590-9, L:/10.188.0.7:41090 - R:management.azure.com/4.150.241.10:443] The connection observed an error, the request cannot be retried as the headers/body were sent io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

My setup uses NGNIX to reverse proxy wondering if this is an error from NGNIX

@limeman40
Copy link
Author

limeman40 commented Dec 8, 2023

I am still really confused why the plugin would tell it to spin down a VM when it has not even reached what I have set for the time the VM should stay up:

"properties": { "title": "Stopping and deallocating", "details": "This virtual machine is stopped and deallocated as requested by an authorized user or process.", "currentHealthStatus": "Available", "previousHealthStatus": "Available", "type": "Downtime", "cause": "UserInitiated" }, "relatedEvents": [] }

You can see here I am telling it to keep the Vms up for 48 hours yet it removes them anyway:

Screenshot from 2023-12-08 16-13-56

It really seems like from my research the the ticket I have open with Azure support it is the plugin telling it to spin down these Vms even though I have told them to stay up for 2 days. I would really like some help figuring this out.

@limeman40
Copy link
Author

Forget it we just going to stop using this plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants