-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with openmpi #158
Comments
Hi @rbeucher I didn't particularly want to build a new version of OpenMPI, however, at the time I felt it was the least worst thing to do, as it was blocking some fairly crucial package updates. In building OpenMPI 4.1.6, I tried to get it as close as possible to NCI's OpenMPI 4.1.5 build. I used the same set of configure flags, (recovered from I hope that's helpful, the documentation for the installations is still online here: https://coecms.github.io/cms-wiki/resources/resources-conda-setup.html, though admittedly there isn't much detail about how MPI works within the environment. Dale |
Hi @dsroberts, thank you. I really appreciate you taking the time to provide such a detailed response. It gives us valuable context and will be very useful moving forward. I'll go through it carefully and let you know if I have any questions. |
Hi @dsroberts,
I noticed that the containerized environment on hh5 is using an OpenMPI build located at /g/data/hh5/public/apps/openmpi/4.1.6. However, upon checking the build, it appears to actually be version 4.1.7.
Since the default OpenMPI on Gadi is now also version 4.1.7, there shouldn’t be any inherent difference. However, I’ve encountered an issue: there are no Conda packages available for OpenMPI 4.1.7, and most existing packages were built with links to version 4.1.6.
Here’s what I’ve tried so far:
Loading the Gadi OpenMPI module before the Singularity module in the .common_v3 script.
Linking to a copy of the OpenMPI 4.1.6 build in /g/data/xp65/public/apps/openmpi/4.1.6. (This is a copy for now, and I may need to perform a proper build here.)
Unfortunately, we’re still encountering intermittent issues, including:
Problems with OpenFabrics and Infiniband.
Significant slowdowns or errors due to missing library links.
I’m curious about the rationale behind your current implementation. I understand you’re busy with your new role, but any insights or advice would be greatly appreciated.
Thanks for your help!
The text was updated successfully, but these errors were encountered: