-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rootful mode - node to node networking does not work #365
Comments
Is this issue reproducible on GHA? |
Probably not, unless you are able and willing to pump out money for custom runners with GPUs! 😆 I can barely get V100s on the clouds... can't imagine what regular use in CI would require. |
"Rootful mode - node to node networking does not work" does not seem relevant to GPUs? |
Good idea! Is there a way to do an equivalent of a restart with lima? I can install docker (rootful, skipping the dockerd-rootless installer tool), and add the user to the docker group, and normally then I need a login / out to get it working. In this case I'm getting a permissions error on the socket (because I haven't). https://github.com/researchapps/usernetes/actions/runs/13460718730/job/37615121293 |
Just found this - testing now! |
All set: #366 Thanks for the help @AkihiroSuda 🙏 |
To give you an update @AkihiroSuda - I've spent about 48 hours on the rootless case, and I've actually gotten it working several times with a strategy that uses cdi from the host. The problem is consistency, and all the manual tweaks / customizations that are required. For example, tonight I've brought up a few clusters per hour, and I'll get it working, try to harden the setup, but then when I bring up (what I deem to be) "the same" again, I get a slightly different error. I'll even see cases where it runs once, and then there is a containerd error about permissions. I'm not sure if you have experience about what might be causing that? That's the error that seems to be unsolvable in the sense that once I see it, there is no way to fix it and go back to a working state. I'm probably not going to work on this over the weekend because I'm a bit behind on everything else that I should have been doing for the last few days. But if you are interested, the VM build, customizations, and setup branch is here. For safe keeping, this is the error that ultimately happens:
I tried mounting a tmpfs there, and various attempts to cleanup and restart, no fix. What seems to be working OK is getting the container built from the host and being able to run
|
Originally posted by @AkihiroSuda in #366 (comment) The nvidia stuff is irrelevant to "Rootful mode - node to node networking does not work", and should be discussed in a separate issue. |
Yes! Apologies for that. It was only relevant in that the NVIDIA GPUs / device plugins install easily with rootful mode, and helped me discover the bug here. |
Hi @AkihiroSuda - I was testing GPU with usernetes, and got fairly far but hit two erroneous cases:
I don't think it's in scope for you to help with getting GPUs working with rootless, but I'm hoping you might have insight for why networking stops working when it's rootful. From the readme it sounds like it should work?
Thanks!
The text was updated successfully, but these errors were encountered: