-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAMMPS parallel MD simulations crashes soon after begin [multi-node + CUDA-aware MPI] #161
Comments
Thanks for detailed error report. Our paper have done 8GPU + 3000K + Si3N4 + > 100,000 atoms simulation. So the high temperature or simulation conditions won't be the cause.
Could you share more details about this? It seems like e3gnn/parallel computes some wrong values. If at least simulations starts, you probably able to get energy and stress of first snapshot in log.lammps, by setting |
Yes. In fact, all my tests were done with thermo 1. But it was useless because the crash happens immediately after MD begins. This is excerpt from the log file:
I used
for this MD. In serial, this runs well with the same geo and all same things. But when I try this with 2 GPUs with parallel, then it crashes. Just in case:
Here is how I executed:
Modules I loaded:
Let me know what else do you need, geo is 8400 atom amorphous silica. |
This: $SLURM_NNODES is incorrect. You should put the number of GPUs you want to use and request the same number of tasks to SLURM. In this case, assuming your GPU node has at least two GPUs, # of tasks = 2, # of GPUs = 2, # of nodes = 1. However, even if the $SLURM_NNODES is given, it would be '1', and then simulation should be run correctly although it will use only one GPU. I'll trying to reproduce and error ASAP. ps1. did you use the exact same MPI module that was used in compiling LAMMPS? |
Our cluster is old and we aren't rich... ;P It has only one GPU per a node. My job script requests 2 GPU nodes and 1 GPU per each node. So, mpirun -np 1 would be just a serial run while requesting 2 GPUs, and it might be slower than real serial run without mpirun. (For serial MD, I use lmp -in input.in to execute without mpirun)
Yes. And my MD script loads the same modules that I used for compiling. I compiled SevenNet and LAMMPS SevenNet to the same virtual environment (using python 3.9). In my MD job script, I activate the virtual environment and load modules.
I know, I just tested just in case if that helps. And it wasn't. :( |
I'm sorry to hear that.. If multi-node is involved it becomes more harder to test but here's some tests I do for debugging purpose. At least, we can narrow down the root cause of the problem.
ps. Are you using CUDA-aware-mpi? SevenNet parallel supports both CUDA-aware-mpi True or False, and both should work. But to the best of my knowledge, as you're using inter-node communication, CUDA-aware-mpi won't help that much. It helps intra-node communication between GPUs. |
I conducted some tests from console command line. With 2 GPUs, I see the same type of non-numeric pressure crash, soon after MD step begin. The same MD runs with 1 GPU, though. With export CUDA_VISIBLE_DEVICES=, mpirun -np 1 lmp -in input runs but very very slow. With
Here's error message from Slurm error file:
All I hear from a guy who manages server that the openmpi I used is CUDA-aware-mpi version. |
Hi devs, sorry to ask this, but if possible, could you test run 1500K 1atm NPT and NVT of any condensed phase solid material with 2 nodes but 1 GPUs for each node? That will be the same condition with MD tests I'm conducting from here. Also, do you use venv or similar virtual environment to compile SevenNet and LAMMPS-SevenNet for your computing cluster environment? I really can't track what is the problem with this strange 2-node crash... From here, both NVT and NPT crashes with 2 nodes (1 GPU per each node). |
Just in case I share this, if this helps to find the source of the crash. I tried to compile SevenNet and LAMMPS SevenNet using cuda 11.3 and prebuilt pytorch 1.12.1. I don't know if this cuda works for SevenNet and LAMMPS SevenNet, but the prebuilt pytorch is included in the requirement range. I compiled them using venv. SevenNet compilation didn't bring any issue. But LAMMPS SevenNet configuration fails with following error message:
I guess cuda 11.3 may not be compatible with SevenNet or LAMMPS SevenNet? BTW, LAMMPS SevenNet with cuda 11.8 and prebuilt pytorch 2.4.1, 2.3.0, 2.3.1 also crashes from 2 and 4 GPUs with the same error "Non-numeric pressure - simulation unstable"... |
Just in case, let me share the Slurm error message from a crash using official example for the parallel md in the source code folder (/example_inputs/md_parallel_example). I used two GPU nodes (1 gpu per each), and executed using
It printed out segfault error.
Again, serial example in serial example folder (example_inputs/md_serial_example) works fine with 1 GPU, with execution command of
I'm using prebuilt pytorch, but would that be a reason behind it? |
Sorry for late reply
It will no longer use cuda-aware-mpi for communication and communication will be done in CPU side. From your last error message, the segfault seems like related driver level issues. For example: Unfortunately, your situation is particularly hard to debug as you're using 2 node. There exists lots of different error sources. We firstly need to succeed parallel + single node + 2 process case with or without GPU. |
First, I set venv and activated venv. Then I compiled custom pytorch 2.2.2 using cmake and python setup (for wheel), sevennet 0.10.3 (using pip install), and LAMMPS with SevenNet module (using cmake) following the instruction. No crashes observed during compilation of those. Slurm environment, and loaded modules are:
Our cluster only have 1 GPU per 1 GPU nodes. So I typically use
to run 2 GPU jobs. I tested using in.lmp, res.dat, and deployed_parallel/ of {SevenNet_sourcecode}/example_inputs/md_parallel_example directory from console command line after I salloc into GPU console with 2 nodes.
Hope these helps. Please let me know if you have any further requests. |
Huge thanks for the very detailed report. As it seems like the only non-working case is cuda-aware mpi + multi-node + multi GPU, I'll try to reproduce it locally. Before then, you can simply use export OFF_E3GNN_PARALLEL_CUDA_MPI=1 to use multi GPU. I think in this particular case where communication is necessarily done via inter-node cable, cuda-aware mpi won't help that much (I'm not sure as this setup is uncommon). The reason you're not seeing speed-up is because of the size of the test system. You need a lot of atoms to fully utilize each GPU. If they are underutilized due to small number of atoms, there is no speed-up. Thanks again for the detailed report. If you don't want export that things, you can simply recompile lammps with mpi that is not cuda-aware. |
Thank you. I understand the official example plays with small number of atoms. I will test bigger systems with "export OFF_E3GNN_PARALLEL_CUDA_MPI=1". The reason I hope to utilize multi-GPU is to overcome OOM crash. With our server, 8k to 9k atoms for condensed phase solid material, and 5k to 7k water molecules are maximum limit with 1 GPU node before facing OOM crash. Any bigger systems require 2+ GPU nodes. (All MDs for my system used pre-trained model of July 11 one. For the example folder tests, I used pre-trained model in the example folder.) But then, what is the role of "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" and how much this would impact on simulation speed compared to "normal" cases? Let me know if you need anything more, and I hope any bypass or patch comes soon. Thanks |
Sorry for the lack of explanation. These two approach have different code to execute and default is if cuda-aware mpi is available, use it as it is more faster. OFF_E3GNN_PARALLEL_CUDA_MPI overwrites this choice and use CPU to communicate data. |
Thanks. Then, would this be a openmpi problem? But our server admin assured me the openmpi we are using is CUDA-aware one... Anyway, I hope this problem resolved asap. Thanks! |
|
Thanks. Any recommended openmpi or other mpi modules for sevennet? I know it is hardware-specific, but there are so many versions, at least I hope to know which mpi would be the good choice to begin the test. Meanwhile, if you guys could conduct tests on this to narrow it down further, that would be great. Thank you. |
CUDA-aware mpi + CUDA + 2 GPU + md_parallel_example is turned out to be safe. The log: PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
Generated 0 of 1 mixed pair_coeff terms from geometric mixing ruleNeighbor list info ... update: every = 1 steps, delay = 0 steps, check = yes max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 6 ghost atom cutoff = 6 binsize = 3, bins = 8 7 7 1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair e3gnn/parallel, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.002
[W129 14:58:54.649093819 TensorAdvancedIndexing.cpp:231] Warning: The reduce argument of torch.scatter with Tensor src is deprecated and will be removed in a future PyTorch release. Use torch.scatter_reduce instead for more reduction options. (function operator())
[W129 14:58:54.649151820 TensorAdvancedIndexing.cpp:231] Warning: The reduce argument of torch.scatter with Tensor src is deprecated and will be removed in a future PyTorch release. Use torch.scatter_reduce instead for more reduction options. (function operator())
Per MPI rank memory allocation (min/avg/max) = 3.344 | 3.344 | 3.344 Mbytes
Step T/CPU PotEng KinEng Volume Press Temp
0 0 -22222.717 49.571266 8625.7383 4085.2407 500
1 0.054480366 -22222.114 48.973063 8625.7383 4088.391 493.96624
2 0.051336446 -22220.725 47.595317 8625.7383 4248.8111 480.06961
3 0.052770978 -22218.757 45.647225 8625.7383 4532.3253 460.42021
4 0.051208429 -22216.527 43.440153 8625.7383 4863.6884 438.15861
5 0.05160953 -22214.399 41.33311 8625.7383 5139.0678 416.90593
Loop time of 0.192437 on 2 procs for 5 steps with 768 atoms
Performance: 4.490 ns/day, 5.345 hours/ns, 25.983 timesteps/s, 19.955 katom-step/s
98.8% CPU use with 2 MPI tasks x 1 OpenMP threads Maybe I could test multi-node with this I recommend you consult with your server administrator. The
I don't have such. One day I tried to compile OpenMPI from source for exercise and failed :P It was out of my domain. |
Thanks for your test. Could you perform tests for 2 nodes while utilizing only 1 GPU for each node? That would be similar to our environment.
I will contact server admin about NVHPC. Thanks. |
Hello, ############################################################################################## RuntimeError: CUDA out of memory. Tried to allocate 13.71 GiB. GPU 0 has a total capacity of 79.10 GiB of which 5.27 GiB is free. Including non-PyTorch memory, this process has 73.82 GiB memory in use. Of the allocated memory 69.16 GiB is allocated by PyTorch, and 3.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ############################################################################################## |
These are my configs |
I think you should write a new post about this instead of replying to other's issue post about different topic. But let me answer anyway. OOM crash means you need to request more GPU nodes for that job. Each GPU node has limited amount of memory, and memory from two nodes aren't enough to describe all atoms and their interactions. |
One more question, instead of "best" or "optimal" options, could you just share me which version of openmpi (or other mpi) and cudnn are used for any working version of SevenNet-LAMMPS? I hope to test different openmpis but blind test everything is crazy, I hope to have some starting point... |
Hi @turbosonics In my case, OpenMPI versions were fine all the time. What you used in this issue is also fine except for CUDA-aware MPI. [parkyutack@odin ~]$ module load NV_HPC/22.7
[parkyutack@odin ~]$ which mpirun
/TGM/Apps/NVHPC/2022_227/Linux_x86_64/22.7/comm_libs/mpi/bin/mpirun
[parkyutack@odin ~]$ mpirun --version
mpirun (Open MPI) 3.1.5
Report bugs to http://www.open-mpi.org/community/help/ As the path says this is the openmpi embedded in NV_HPC kit. For the version I used in doing the experiments in the paper (where multi-node successfully worked) It says like this (copied from the paper):
|
Thank you so much for more info. Could you conduct the same LAMMPS-SevenNet tests with 2 or 4 GPU nodes with 1 GPU per each node? I just hope to isolate the reason behind this further. Due to crazy expensive price of GPU these days, I'm sure some people will try to use SevenNet with similar or worse environment, and I think it would be great for SevenNet to make sure of this...
Thanks to remind this. I will discuss about NVHPC with our server admin. From our cluster, I've been testing using openmpi 4.1.1 & 4.1.7 + Cuda 12.1 + Pytorch 2.2.2 (prebuilt one and also tried manually compiled one) + cuDNN 8.1.1 + gcc/11.2.0. I tested both openmpi 4.1.1 and 4.1.7 (installed by server admin), and I tried prebuilt version and also manually compiled Pytorch 2.2.2 for both openmpi conditions. All tests have been conducted from separate venv. And I saw the same thing happens with my test results, multi-GPU-node only works when CPU performs the node-to-node communication. Since two openmpi version had a same issue, I think the problem might be openmpi version, but not sure what is the reason... |
Sorry to ask this again, but could you please test 2 or 4 GPU node with one GPU each case using example and see if the current version of SevenNet-LAMMPS works from your environment or not? During last 2 weeks, I've been testing various versions of openmpi and pytorch combinations but all failed. It seems like putting more effort become waste of time. All I hope to see is the confirmation. The confirmation that this same crash happens from different hardware conditions but with similar GPU setting. Then I can peacefully stop this endless testing. And I need to share this to our server people that this is not our server problem. |
@YutackPark I loaded nvhpc and cmake module. Then I modified and used nvhpc.cmake preset for nvc++, nvc, nvfortran, and mpicxx, and used that preset. But configuration error says nvcc is broken. But when I check $ nvcc --version, it prints out proper output.
Would it be possible the nvhpc of our cluster used CUDA 10 instead of CUDA11 or 12? |
I'm trying to run series of high temperature (1500K~3000K) 1atm NPT and high temperature NVT simulations with 1fs timestep using pre-trained model (July 2024) of SevenNet. But the parallel simulations are too unstable.
The exact same simulation runs with serial version without any issues. No crashes, no errors, no nothing. But if I submit the same job with e3gnn/parallel with 2 GPUs, MD become very unstable and crashes soon after MD begins.
I just compiled SevenNet & LAMMPS SevenNet, so they are 0.10.3 version. OpenMPI version of our local cluster is CUDA-aware one, and I used CUDA 11.8 with prebuilt pytorch 2.4.0.
Geometry is not that huge, it contains just 8400 atoms (it is just silica), and the crash is not OOM.
Very first crash I faced with high temperature 1atm NPT was:
The log file just printed the information of step 0 then immediately crashed.
I have following lines in LAMMPS input script:
So, when I remove that line and resubmit the same job, then:
Then, I changed to NVT with the same temperature:
I tried 0.5fs timestep but non-numeric error appeared. But serial run with 1fs runs smoothly well. All crashes happened within 30seconds after I submit the job.
I just want to know where the problem comes from. Pre-trained model and my LAMMPS input script should be fine, because serial MD with 1 GPU is still running well for more than 12 hours. Then, this would be problem of SevenNet parallel or my installation or local server cluster...
Does anyone tried high temperature NPT and NVT using LAMMPS SevenNet using 2+ GPUs for the system of 8k~10k or more number of atoms? Anyone faced similar problem?
The text was updated successfully, but these errors were encountered: