-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugins cleanup actions is removing VMs that are in use and working. #481
Comments
I hope someone will look into this. This issue is causing me to have to baby sit Jenkins builds all day |
Can you please tell me how the cleanup process works? I am seeing this in the activity logs:
I feel like there some timeout that is 15 minutes and it is cleaning up VMs that are still in use. |
These are the three tasks that are run: Code shouldn't be too hard to follow, the first and third will be the most interesting I think. |
Sure but have any ideas what could cause this race condition. My co-worker had this thought “it wouldn't be the quotas im wonder if we are hitting the cap set in the plugin” is there any caps set in the plugin. I feel like without more guidance I am looking for a needle in a haystack. I do have some Java experience but it has been a bit since I coded anything in it. |
It is defiantly something with the plugin. I have a ticket open with Azure on this issue. They have seen in the logs the application ID asking for the VM to be deleted is the app ID we setup for this integration into Azure. I am possible look into this next week but its been a while since I have done anything in Java and am unsure of whta I will be able to work out |
Still have no idea on this.. It seems to be some kind of race condition.. Something with the cleanup aspect of the code.. VMs only typically stay connected for about 2 hours tops and then things go sideways.. I am surprised nobody else has run into this issue. |
Unsure, we would sometimes have ours up for many hours and definitely don't hit an issue like this |
The only way I been to collateral anything is I see message in the activity logs in Azure health events saying the VM was not around for 10 to 15 minutes. However I am not seeing anything in the plugin logs says it will remove that VM so I am not sure what is happening. Is there anything besides the cleanup functions that could cause this. It is a shame we have used this plugin for probably 2 years without issue and now all the sudden there some problem. I have even tried to pull the plugin out completely and put it back and the issue persists. |
You could maybe add logging here: azure-vm-agents-plugin/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java Lines 247 to 250 in d35d6b3
to see why it's closing. Is there anything in the agent log (may be hard to get)? Moon shot but maybe an inbound agent would work better? They should be more resilient. |
Can you give me a few more details on how I would add logging to this section? Could you give me an example of what this would look like code wise? I guess I can lookup how to create an HPI out of my changes. Also are you suggesting I use a JNLP connection instead? I can try it tomorrow and see if I have better luck just making sure I understand what you are suggesting. |
Add something similar to https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d35d6b366b733b5475a354d0f39815051bbecf04/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L261C13-L261C89
Yes, JNLP although it has been renamed to Inbound agent. |
Sorry little confused on inbound agent I would think if you tell it you want it to use inbound agent it would like just use that init script to set it up. I will give both things a try to day and see if I can gain more details on this issue. |
Maybe it could, currently it's setup to be quite flexible so you can configure the agent however you like and an example is given to make it easy for you to setup. |
Please correct me if I am wrong but I would think selecting this option would automatically have it just use the PS1 scripts to connect right? If you tell it to use SSH it does all that script init stuff for you: Is this not the case if I choose this option. I am a little confused and need more details on how to property setup Inbound connections for this. |
No inbound is more complicated as Jenkins doesn't reach out to your agent at all. The init script is uploaded to a storage account and the run on VM startup and either that or something in the VM image needs to do things like include the remoting jar file and a service. Are you using Windows agents btw? (just from looking at that screenshot), I don't have much experience with them, although the Jenkins project does use them quite a lot without this issue as far I know, although I think they use them as 'one-shot' agents and don't do multiple builds on them |
I have both windows and linux agents. Most of my stuff runs on windows. I had another question so I choose "Idle Retention Strategy" per your suggestion. However even if I tell it 0 for timeout the VMs only last around 2 hours and still just gets removed. Is it also possible there is something wrong with the way this timeout is being set in the code? Currently we are using a mix of both windows agents and linux agents being setup via SSH connection. All was working fine for 2 years now all the sudden this issue has come up and I can not for the life of me figure out why. I have looked through the Activity logs in Azure as well as the various logs in Jenkins it is not showing me much as to why this happens. If you have any other ideas please let me know I am at a loss right now. It is also getting tiresome having to baby site the Jenkins server all day long. |
0 will mean it won't go idle: You would see this log line anyway: You should be able to get help from MS Support on this as these are officially supported (for another 2 months anyway) |
As I have previously stated I have a support ticket open. They have looked through the logs on their end they are saying the Enterprise Application ID that is asking for the VMs deletion is the one we have setup as the service principle for this plugin. The support person has agreed to have a Teams meeting with me to discuss this issue. You are correct it is curious to me that I am not seeing it saying it going to cleanup a VM in the logs in Jenkins in the plugin. So I am not totally believing what support is saying. I am going to try to get a inbound agent configuration working today and see if performs any better. I have not gotten around to putting a try or catch around the function call. I will see if I can do this today. |
I tried everything to get this inbound connection setup nothing works. Would it be possible for you to please try the steps on your end figure out what they are and report back. Nothing I am trying is working I am started to get very frustrated with this issue. Also what about this issue 484. This seems kind of related to my issue possibly. |
have you tried logging into the VM that's connecting to the Jenkins controller as an inbound agent? Basically you:
And log in to the agent before it gets deleted and check the logs if any issue i don't think #484 is related but unsure without looking closer. |
@timja Not sure if you saw my note a couple weeks back. We had an issue where I deleted some files needed for Jenkins and I did a restore on the Jenkins Controller from an Azure snapshot it all seems to work fine. However could this have caused any issues with this plugin? I have even tried to pull out the plugin completly and readd it back in. Just curious if this could have any bearing on the issue I am facing. Just trying to go over all avenues to try to figure out what is going on here. |
Not really sure, maybe another plugin or library could be conflicting. You could try updating all plugins / removing some, only a guess though based on never having seen this before |
I tried to reinstall each plugin via the HPI file that seems to not have done anything this evening I see this error in the log which is perplexing to me:
My setup uses NGNIX to reverse proxy wondering if this is an error from NGNIX |
I am still really confused why the plugin would tell it to spin down a VM when it has not even reached what I have set for the time the VM should stay up:
You can see here I am telling it to keep the Vms up for 48 hours yet it removes them anyway: It really seems like from my research the the ticket I have open with Azure support it is the plugin telling it to spin down these Vms even though I have told them to stay up for 2 days. I would really like some help figuring this out. |
Forget it we just going to stop using this plugin |
Jenkins and plugins versions report
Environment
What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Ubuntu 22.04.3 LTS
Agent: Windows Server 2019 Datacenter
Agent: Ubuntu 22.04.3 LTS
Reproduction steps
Expected Results
The VMs stay connected for the Idle Retention Strategy choose
Actual Results
VMs disconnect at various intervals and I have to baby sit builds all day not a great use of my time.
Anything else?
Things have gotten better after I switched to "Azure VM Idle Retention Strategy" from "Azure VM Pool Retention Strategy (Experimental)"
Previously it was happening more often like every 20 to 30 minutes now it happens every 2 hours.
I opened a support case with Microsoft and they looked at the back end to see what was going on. I put a delete policy lock on the resource group level to allow them to see what is trying to delete the VMs. It was the application we setup for this integration the app ID lines up with what he is seeing in the logs.
This used to work fine but not it has been broken for a while. One other thing to note is I had to do a restore of this VM from snapshot due to a delete issue with the files for Jenkins. Not sure how that would cause this to happen though. I think it is more likely it is a new bug in this plugin.
The text was updated successfully, but these errors were encountered: