Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS parallel MD simulations crashes soon after begin [multi-node + CUDA-aware MPI] #161

Open
turbosonics opened this issue Jan 16, 2025 · 27 comments

Comments

@turbosonics
Copy link

I'm trying to run series of high temperature (1500K~3000K) 1atm NPT and high temperature NVT simulations with 1fs timestep using pre-trained model (July 2024) of SevenNet. But the parallel simulations are too unstable.

The exact same simulation runs with serial version without any issues. No crashes, no errors, no nothing. But if I submit the same job with e3gnn/parallel with 2 GPUs, MD become very unstable and crashes soon after MD begins.

I just compiled SevenNet & LAMMPS SevenNet, so they are 0.10.3 version. OpenMPI version of our local cluster is CUDA-aware one, and I used CUDA 11.8 with prebuilt pytorch 2.4.0.

Geometry is not that huge, it contains just 8400 atoms (it is just silica), and the crash is not OOM.

Very first crash I faced with high temperature 1atm NPT was:

ERROR on proc 0: Too many neighbor bins (src/nbin_standard.cpp:213)
Last command: run             100000

The log file just printed the information of step 0 then immediately crashed.

I have following lines in LAMMPS input script:

neighbor        1 bin
neigh_modify    every 10 delay 0 check no

So, when I remove that line and resubmit the same job, then:

ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Last command: run             100000

Then, I changed to NVT with the same temperature:

ERROR: Pair e3gnn requires consecutive atom IDs (src/pair_e3gnn_parallel.cpp:204)
Last command: run             100000

I tried 0.5fs timestep but non-numeric error appeared. But serial run with 1fs runs smoothly well. All crashes happened within 30seconds after I submit the job.

I just want to know where the problem comes from. Pre-trained model and my LAMMPS input script should be fine, because serial MD with 1 GPU is still running well for more than 12 hours. Then, this would be problem of SevenNet parallel or my installation or local server cluster...

Does anyone tried high temperature NPT and NVT using LAMMPS SevenNet using 2+ GPUs for the system of 8k~10k or more number of atoms? Anyone faced similar problem?

@YutackPark
Copy link
Member

Thanks for detailed error report. Our paper have done 8GPU + 3000K + Si3N4 + > 100,000 atoms simulation. So the high temperature or simulation conditions won't be the cause.

I tried 0.5fs timestep but non-numeric error appeared

Could you share more details about this? It seems like e3gnn/parallel computes some wrong values. If at least simulations starts, you probably able to get energy and stress of first snapshot in log.lammps, by setting thermo 1.

@turbosonics
Copy link
Author

turbosonics commented Jan 16, 2025

Could you share more details about this? It seems like e3gnn/parallel computes some wrong values. If at least simulations starts, you probably able to get energy and stress of first snapshot in log.lammps, by setting thermo 1.

Yes. In fact, all my tests were done with thermo 1. But it was useless because the crash happens immediately after MD begins. This is excerpt from the log file:

Per MPI rank memory allocation (min/avg/max) = 14.61 | 14.61 | 14.61 Mbytes
   Step          Time           CPU            Temp          PotEng         KinEng         TotEng         Volume        Density        Enthalpy        Press           Pxx            Pyy            Pzz            Pxy            Pxz            Pyz             Lx             Ly             Lz            Xlo            Xhi            Ylo            Yhi            Zlo            Zhi
         0   0              0              1500          -62826.285      1628.4839     -61197.801      127736.31      2.187025      -58210.753      37466.072      131781.15     -44333.013      24950.083     -77481.717      55015.9       -37282.4        50.406284      50.298206      50.382206      0              50.406284      0              50.298206      0              50.382206
         1   0.001          0.36608123     41968.468     -62486.863      45563.317     -16923.546      127736.84      2.1870159      2597.5891      244849.53      302074.14      146287.67      286186.78      4126.1976      90559.068     -43493.681      50.406353      50.298275      50.382276     -3.4664005e-05  50.406319     -3.4589681e-05  50.298241     -3.4647447e-05  50.382241
         2   0.002          0.7405778      0             -595047.26      0             -595047.26      127744.25      2.1868891      37229548       4.7439849e+08 -16598257       5.0485517e+08  9.3493855e+08 -72115753       1.9693267e+08  7.80558e+08    50.407328      50.299248      50.38325      -0.00052201274  50.406806     -0.00052089348  50.298727     -0.00052176339  50.382728
         3   0.003          1.0752766      0             -1.0315807e+10  0             -1.0315807e+10  134596.07      2.0755621      1.9616428e+13  2.3362873e+14  1.9426976e+14  3.6426898e+14  1.4234743e+14  2.5372405e+14 -1.1384995e+14 -1.8678415e+14  51.292912      51.182933      51.268411     -0.44331392     50.849598     -0.4423634      50.74057      -0.44310216     50.825309
         4   0.004          1.1106459      0             -14540.14       0             -14540.14       inf            0             -nan            0              0              0              0              0              0              0              inf            inf            inf           -inf            inf           -inf            inf           -inf            inf  
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Last command: run             ${iter01}

I used

velocity        all create 1500 500000 mom yes rot yes dist gaussian
fix             npt02_NPTheating all npt temp 1500 1500 0.1 iso 1 1 1

for this MD.

In serial, this runs well with the same geo and all same things. But when I try this with 2 GPUs with parallel, then it crashes.

Just in case:

pair_style      e3gnn/parallel
pair_coeff      * * 5 ./deployed_parallel_0.pt ./deployed_parallel_1.pt ./deployed_parallel_2.pt ./deployed_parallel_3.pt ./deployed_parallel_4.pt O Si

Here is how I executed:

mpirun -np $SLURM_NNODES /home/user/bin/lmp_SevenNet_v0p10p3_noD3_cuda118_pytorch240_extra_20250115 -in SevenNet.input

Modules I loaded:

module load gcc/11.2.0
module load openmpi/4.1.1-gcc-milan-a100
module load cuda11.8
module load cudnn/8.1.1.33-11.2-gcc-milan-a100

Let me know what else do you need, geo is 8400 atom amorphous silica.

@YutackPark
Copy link
Member

This: $SLURM_NNODES is incorrect. You should put the number of GPUs you want to use and request the same number of tasks to SLURM. In this case, assuming your GPU node has at least two GPUs, # of tasks = 2, # of GPUs = 2, # of nodes = 1.

However, even if the $SLURM_NNODES is given, it would be '1', and then simulation should be run correctly although it will use only one GPU. I'll trying to reproduce and error ASAP.

ps1. did you use the exact same MPI module that was used in compiling LAMMPS?
ps2. sevenn_get_model -p will yield a directory deployed_parallel. You can simply put path to this directory instead of enumerating long lists of parallel potentials.

@turbosonics
Copy link
Author

turbosonics commented Jan 16, 2025

This: $SLURM_NNODES is incorrect. You should put the number of GPUs you want to use and request the same number of tasks to SLURM. In this case, assuming your GPU node has at least two GPUs, # of tasks = 2, # of GPUs = 2, # of nodes = 1.

However, even if the $SLURM_NNODES is given, it would be '1', and then simulation should be run correctly although it will use only one GPU. I'll trying to reproduce and error ASAP.

Our cluster is old and we aren't rich... ;P It has only one GPU per a node. My job script requests 2 GPU nodes and 1 GPU per each node. So, mpirun -np 1 would be just a serial run while requesting 2 GPUs, and it might be slower than real serial run without mpirun. (For serial MD, I use lmp -in input.in to execute without mpirun)
Let me test different mpirun executions with different options. But mpirun -np 2 brings the same non-numeric pressure crash.

ps1. did you use the exact same MPI module that was used in compiling LAMMPS?

Yes. And my MD script loads the same modules that I used for compiling. I compiled SevenNet and LAMMPS SevenNet to the same virtual environment (using python 3.9). In my MD job script, I activate the virtual environment and load modules.

ps2. sevenn_get_model -p will yield a directory deployed_parallel. You can simply put path to this directory instead of enumerating long lists of parallel potentials.

I know, I just tested just in case if that helps. And it wasn't. :(

@YutackPark
Copy link
Member

I'm sorry to hear that.. If multi-node is involved it becomes more harder to test but here's some tests I do for debugging purpose. At least, we can narrow down the root cause of the problem.

  1. export CUDA_VISIBLE_DEVICES= then GPU is not detected. Run mpirun -np 2 {lmp} -in {script}. To see whether it works with CPU.

  2. Similarly, mpirun -np 1 {lmp} -in {script}. With CPU or GPU. If your cluster allows you to 'ssh' directly into it, I recommend to do that instead of using job script as it is often much easier to debug.

ps. Are you using CUDA-aware-mpi? SevenNet parallel supports both CUDA-aware-mpi True or False, and both should work. But to the best of my knowledge, as you're using inter-node communication, CUDA-aware-mpi won't help that much. It helps intra-node communication between GPUs.

@turbosonics
Copy link
Author

turbosonics commented Jan 17, 2025

I'm sorry to hear that.. If multi-node is involved it becomes more harder to test but here's some tests I do for debugging purpose. At least, we can narrow down the root cause of the problem.

  1. export CUDA_VISIBLE_DEVICES= then GPU is not detected. Run mpirun -np 2 {lmp} -in {script}. To see whether it works with CPU.
  2. Similarly, mpirun -np 1 {lmp} -in {script}. With CPU or GPU. If your cluster allows you to 'ssh' directly into it, I recommend to do that instead of using job script as it is often much easier to debug.

I conducted some tests from console command line. With 2 GPUs, I see the same type of non-numeric pressure crash, soon after MD step begin. The same MD runs with 1 GPU, though.

With export CUDA_VISIBLE_DEVICES=, mpirun -np 1 lmp -in input runs but very very slow.

With

$ export CUDA_VISIBLE_DEVICES=
$ mpirun -np 2 lmp -in input

Here's error message from Slurm error file:

LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0 0 0) to (50.406284 50.298206 50.382206)
  2 by 1 by 1 MPI processor grid
  reading atoms ...
  8400 atoms
  read_data CPU = 0.016 seconds
PairE3GNNParallel using device : CPU
PairE3GNNParallel cuda-aware mpi: True
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 10 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 17 17 17
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
[cluster137:09692] *** An error occurred in MPI_Irecv
[cluster137:09692] *** reported by process [1336016897,0]
[cluster137:09692] *** on communicator MPI_COMM_WORLD
[cluster137:09692] *** MPI_ERR_BUFFER: invalid buffer pointer
[cluster137:09692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cluster137:09692] ***    and potentially your MPI job)

ps. Are you using CUDA-aware-mpi? SevenNet parallel supports both CUDA-aware-mpi True or False, and both should work. But to the best of my knowledge, as you're using inter-node communication, CUDA-aware-mpi won't help that much. It helps intra-node communication between GPUs.

All I hear from a guy who manages server that the openmpi I used is CUDA-aware-mpi version.

@turbosonics
Copy link
Author

turbosonics commented Jan 21, 2025

Hi devs, sorry to ask this, but if possible, could you test run 1500K 1atm NPT and NVT of any condensed phase solid material with 2 nodes but 1 GPUs for each node? That will be the same condition with MD tests I'm conducting from here.

Also, do you use venv or similar virtual environment to compile SevenNet and LAMMPS-SevenNet for your computing cluster environment?

I really can't track what is the problem with this strange 2-node crash... From here, both NVT and NPT crashes with 2 nodes (1 GPU per each node).

@turbosonics
Copy link
Author

turbosonics commented Jan 21, 2025

Just in case I share this, if this helps to find the source of the crash.

I tried to compile SevenNet and LAMMPS SevenNet using cuda 11.3 and prebuilt pytorch 1.12.1. I don't know if this cuda works for SevenNet and LAMMPS SevenNet, but the prebuilt pytorch is included in the requirement range. I compiled them using venv.

SevenNet compilation didn't bring any issue. But LAMMPS SevenNet configuration fails with following error message:

-- The CXX compiler identification is GNU 11.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /cm/local/apps/gcc/11.2.0/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /cm/shared/userapps/opensource-23/milan-a100/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/_/linux-rocky8-zen/gcc-8.5.0/git-2.31.1-p3mgayxz32agrpiuzjdkx66bt53u5omm/bin/git (found version "2.31.1")
-- Appending /cm/local/apps/python39/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64 to CMAKE_LIBRARY_PATH: /cm/local/apps/python39/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64
-- Running check for auto-generated files from make-based build system
-- Running in virtual environment: /home/user/venv_sevennet_v0p10p3_gpu_cuda113_prebuilt_pytorch1121_20250121
   Setting Python interpreter to: /home/user/venv_sevennet_v0p10p3_gpu_cuda113_prebuilt_pytorch1121_20250121/bin/python
-- Found MPI_CXX: /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Looking for C++ include omp.h
-- Looking for C++ include omp.h - found
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5") found components: CXX
-- Found GZIP: /usr/bin/gzip
-- Could NOT find FFMPEG (missing: FFMPEG_EXECUTABLE)
-- Looking for C++ include cmath
-- Looking for C++ include cmath - found
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Could NOT find ClangFormat (missing: ClangFormat_EXECUTABLE) (Required is at least version "8.0")
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * OpenMP

-- <<< Build configuration >>>
   LAMMPS Version:   20230802 stable_2Aug2023_update3-modified
   Operating System: Linux Rocky 8.6
   CMake Version:    3.21.4
   Build type:       RelWithDebInfo
   Install path:     /home/user/.local
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: <None>
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /cm/local/apps/gcc/11.2.0/bin/c++
      Type:          GNU
      Version:       11.2.0
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_GZIP
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/include
-- MPI libraries:    /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi_cxx.so;/cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so;
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /cm/shared/apps/cuda11.3 (found version "11.3")
-- The CUDA compiler identification is unknown
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - failed
-- Check for working CUDA compiler: /cm/shared/apps/cuda11.3/bin/nvcc
-- Check for working CUDA compiler: /cm/shared/apps/cuda11.3/bin/nvcc - broken
CMake Error at /cm/shared/userapps/opensource-23/milan-a100/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/_/linux-rocky8-zen/gcc-8.5.0/cmake-3.21.4-v4kvg7me7jzftye6qz4mja3ahurweqpq/share/cmake-3.21/Modules/CMakeTestCUDACompiler.cmake:56 (message):
  The CUDA compiler

    "/cm/shared/apps/cuda11.3/bin/nvcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_4b4d4/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_4b4d4.dir/build.make CMakeFiles/cmTC_4b4d4.dir/build
    gmake[1]: Entering directory '/home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_4b4d4.dir/main.cu.o
    /cm/shared/apps/cuda11.3/bin/nvcc      -c /home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_4b4d4.dir/main.cu.o
    In file included from /cm/shared/apps/cuda11.3/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                     from <command-line>:
    /cm/shared/apps/cuda11.3/bin/../targets/x86_64-linux/include/crt/host_config.h:139:2: error: #error -- unsupported GNU version! gcc versions later than 10 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
      139 | #error -- unsupported GNU version! gcc versions later than 10 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
          |  ^~~~~
    gmake[1]: *** [CMakeFiles/cmTC_4b4d4.dir/build.make:78: CMakeFiles/cmTC_4b4d4.dir/main.cu.o] Error 1
    gmake[1]: Leaving directory '/home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeTmp'
    gmake: *** [Makefile:127: cmTC_4b4d4/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  /home/user/venv_sevennet_v0p10p3_gpu_cuda113_prebuilt_pytorch1121_20250121/lib/python3.9/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:47 (enable_language)
  /home/user/venv_sevennet_v0p10p3_gpu_cuda113_prebuilt_pytorch1121_20250121/lib/python3.9/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/user/venv_sevennet_v0p10p3_gpu_cuda113_prebuilt_pytorch1121_20250121/lib/python3.9/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1083 (find_package)

-- Configuring incomplete, errors occurred!
See also "/home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeOutput.log".
See also "/home/user/Sourcecode_LAMMPS_SevenNet_v0p10p3_cuda113_pytorch1121_20250121/build/CMakeFiles/CMakeError.log".

I guess cuda 11.3 may not be compatible with SevenNet or LAMMPS SevenNet?

BTW, LAMMPS SevenNet with cuda 11.8 and prebuilt pytorch 2.4.1, 2.3.0, 2.3.1 also crashes from 2 and 4 GPUs with the same error "Non-numeric pressure - simulation unstable"...

@turbosonics
Copy link
Author

turbosonics commented Jan 21, 2025

Just in case, let me share the Slurm error message from a crash using official example for the parallel md in the source code folder (/example_inputs/md_parallel_example). I used two GPU nodes (1 gpu per each), and executed using

mpirun -np 2 {lmp} -in in.lmp 

It printed out segfault error.

[cluster131:50634:0:50634] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x155401e20400)
==== backtrace (tid:  50634) ====
 0  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x15548b209894]
 1  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31a4f) [0x15548b209a4f]
 2  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31d16) [0x15548b209d16]
 3  /lib64/libc.so.6(+0xcfce3) [0x15550243ece3]
 4  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69) [0x15548b6e9109]
 5  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd) [0x15548b7505fd]
 6  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7) [0x15548a2f7b17]
 7  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7) [0x15548b7526a7]
 8  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c) [0x15548b76297c]
 9  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9) [0x15548bbf7bb9]
10  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b) [0x1555547a745b]
11  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x6270c4]
12  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x86d379]
13  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x5c675a]
14  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x55d579]
15  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x45d5b0]
16  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x45d8ce]
17  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x43cd6d]
18  /lib64/libc.so.6(__libc_start_main+0xf3) [0x1555023a9ca3]
19  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x43df8e]
=================================
[cluster131:50634] *** Process received signal ***
[cluster131:50634] Signal: Segmentation fault (11)
[cluster131:50634] Signal code:  (-6)
[cluster131:50634] Failing at address: 0x47a0000c5ca
[cluster131:50634] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x155554c92ce0]
[cluster131:50634] [ 1] /lib64/libc.so.6(+0xcfce3)[0x15550243ece3]
[cluster131:50634] [ 2] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69)[0x15548b6e9109]
[cluster131:50634] [ 3] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd)[0x15548b7505fd]
[cluster131:50634] [ 4] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7)[0x15548a2f7b17]
[cluster131:50634] [ 5] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7)[0x15548b7526a7]
[cluster131:50634] [ 6] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c)[0x15548b76297c]
[cluster131:50634] [ 7] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9)[0x15548bbf7bb9]
[cluster131:50634] [ 8] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x1555547a745b]
[cluster131:50634] [ 9] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x6270c4]
[cluster131:50634] [10] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x86d379]
[cluster131:50634] [11] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x5c675a]
[cluster131:50634] [12] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x55d579]
[cluster131:50634] [13] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x45d5b0]
[cluster131:50634] [14] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x45d8ce]
[cluster131:50634] [15] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x43cd6d]
[cluster131:50634] [16] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1555023a9ca3]
[cluster131:50634] [17] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x43df8e]
[cluster131:50634] *** End of error message ***
[cluster133:612533:0:612533] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x155403e20400)
==== backtrace (tid: 612533) ====
 0  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x1554903a6894]
 1  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31a4f) [0x1554903a6a4f]
 2  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31d16) [0x1554903a6d16]
 3  /lib64/libc.so.6(+0xcfce3) [0x15550243ece3]
 4  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69) [0x155490886109]
 5  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd) [0x1554908ed5fd]
 6  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7) [0x15548b33bb17]
 7  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7) [0x1554908ef6a7]
 8  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c) [0x1554908ff97c]
 9  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9) [0x155490d94bb9]
10  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b) [0x1555547a745b]
11  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x6270c4]
12  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x86d379]
13  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x5c675a]
14  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x55d579]
15  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x45d5b0]
16  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x45d8ce]
17  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x43cd6d]
18  /lib64/libc.so.6(__libc_start_main+0xf3) [0x1555023a9ca3]
19  /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd() [0x43df8e]
=================================
[cluster133:612533] *** Process received signal ***
[cluster133:612533] Signal: Segmentation fault (11)
[cluster133:612533] Signal code:  (-6)
[cluster133:612533] Failing at address: 0x47a000958b5
[cluster133:612533] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x155554c92ce0]
[cluster133:612533] [ 1] /lib64/libc.so.6(+0xcfce3)[0x15550243ece3]
[cluster133:612533] [ 2] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69)[0x155490886109]
[cluster133:612533] [ 3] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd)[0x1554908ed5fd]
[cluster133:612533] [ 4] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7)[0x15548b33bb17]
[cluster133:612533] [ 5] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7)[0x1554908ef6a7]
[cluster133:612533] [ 6] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c)[0x1554908ff97c]
[cluster133:612533] [ 7] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9)[0x155490d94bb9]
[cluster133:612533] [ 8] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x1555547a745b]
[cluster133:612533] [ 9] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x6270c4]
[cluster133:612533] [10] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x86d379]
[cluster133:612533] [11] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x5c675a]
[cluster133:612533] [12] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x55d579]
[cluster133:612533] [13] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x45d5b0]
[cluster133:612533] [14] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x45d8ce]
[cluster133:612533] [15] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x43cd6d]
[cluster133:612533] [16] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1555023a9ca3]
[cluster133:612533] [17] /home/user/bin/lmp_SevenNet_v0p10p3_gpu_cuda118_pytorch241_20250121_2nd[0x43df8e]
[cluster133:612533] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cluster131 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Again, serial example in serial example folder (example_inputs/md_serial_example) works fine with 1 GPU, with execution command of

{lmp} -in in.lmp

I'm using prebuilt pytorch, but would that be a reason behind it?

@YutackPark
Copy link
Member

Sorry for late reply
I recommend to set the environment variable
export OFF_E3GNN_PARALLEL_CUDA_MPI=1
then try below in following order.

  1. mpirun -np 2 with cpu single node
  2. mpirun -np 2 with cpu two nodes
  3. mpirun -np 2 with gpu two nodes

It will no longer use cuda-aware-mpi for communication and communication will be done in CPU side. From your last error message, the segfault seems like related driver level issues. For example:

openucx/ucx#7845

Unfortunately, your situation is particularly hard to debug as you're using 2 node. There exists lots of different error sources. We firstly need to succeed parallel + single node + 2 process case with or without GPU.

@turbosonics
Copy link
Author

turbosonics commented Jan 27, 2025

First, I set venv and activated venv. Then I compiled custom pytorch 2.2.2 using cmake and python setup (for wheel), sevennet 0.10.3 (using pip install), and LAMMPS with SevenNet module (using cmake) following the instruction.

No crashes observed during compilation of those. Slurm environment, and loaded modules are:

$ module load cuda12.1 gcc/11.2.0 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100

Our cluster only have 1 GPU per 1 GPU nodes. So I typically use

#SBATCH --nodes=2
#SBATCH --gres=gpu:1

to run 2 GPU jobs.

I tested using in.lmp, res.dat, and deployed_parallel/ of {SevenNet_sourcecode}/example_inputs/md_parallel_example directory from console command line after I salloc into GPU console with 2 nodes.

salloc -psc_int_gpu -N2 -n2 --gres=gpu:1
  1. Just $ mpirun -np 1 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 1 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.000 seconds
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.463 | 3.463 | 3.463 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.715      49.571266      8625.7383      4085.2502      500
         1   0.067074898   -22222.115      48.973063      8625.7383      4088.3949      493.96624
         2   0.065407594   -22220.723      47.595317      8625.7383      4248.8131      480.06962
         3   0.065605995   -22218.758      45.647226      8625.7383      4532.3227      460.42022
         4   0.065609901   -22216.527      43.440154      8625.7383      4863.6849      438.15862
         5   0.065825567   -22214.398      41.33311       8625.7383      5139.062       416.90594
Loop time of 0.152783 on 1 procs for 5 steps with 768 atoms

Performance: 5.655 ns/day, 4.244 hours/ns, 32.726 timesteps/s, 25.134 katom-step/s
88.1% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.14734    | 0.14734    | 0.14734    |   0.0 | 96.44
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 5.4686e-05 | 5.4686e-05 | 5.4686e-05 |   0.0 |  0.04
Output  | 0.0052524  | 0.0052524  | 0.0052524  |   0.0 |  3.44
Modify  | 8.4507e-05 | 8.4507e-05 | 8.4507e-05 |   0.0 |  0.06
Other   |            | 5.461e-05  |            |       |  0.04

Nlocal:            768 ave         768 max         768 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           2196 ave        2196 max        2196 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:        60672 ave       60672 max       60672 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 60672
Ave neighs/atom = 79
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:02
  1. Just $ mpirun -np 2 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 2 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 2 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.001 seconds
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
[cluster128:493841:0:493841] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1553e8020400)
==== backtrace (tid: 493841) ====
 0  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x1554dc7ad894]
 1  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31a4f) [0x1554dc7ada4f]
 2  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31d16) [0x1554dc7add16]
 3  /lib64/libc.so.6(+0xcfce3) [0x15553b422ce3]
 4  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69) [0x1554dcc8d109]
 5  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd) [0x1554dccf45fd]
 6  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7) [0x1554d7746b17]
 7  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7) [0x1554dccf66a7]
 8  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c) [0x1554dcd0697c]
 9  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9) [0x1554dd19bbb9]
10  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b) [0x1555547a745b]
11  lmp() [0x6bfca4]
12  lmp() [0x909856]
13  lmp() [0x5cd89a]
14  lmp() [0x565724]
15  lmp() [0x4696e6]
16  lmp() [0x4699ce]
17  lmp() [0x44a0fd]
18  /lib64/libc.so.6(__libc_start_main+0xf3) [0x15553b38dca3]
19  lmp() [0x44b37e]
=================================
[cluster128:493841] *** Process received signal ***
[cluster128:493841] Signal: Segmentation fault (11)
[cluster128:493841] Signal code:  (-6)
[cluster128:493841] Failing at address: 0x47a00078911
[cluster128:493841] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x155554c92ce0]
[cluster128:493841] [ 1] /lib64/libc.so.6(+0xcfce3)[0x15553b422ce3]
[cluster128:493841] [ 2] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69)[0x1554dcc8d109]
[cluster128:493841] [ 3] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd)[0x1554dccf45fd]
[cluster128:493841] [ 4] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7)[0x1554d7746b17]
[cluster128:493841] [ 5] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7)[0x1554dccf66a7]
[cluster128:493841] [ 6] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c)[0x1554dcd0697c]
[cluster128:493841] [ 7] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9)[0x1554dd19bbb9]
[cluster128:493841] [ 8] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x1555547a745b]
[cluster128:493841] [ 9] lmp[0x6bfca4]
[cluster128:493841] [10] lmp[0x909856]
[cluster128:493841] [11] lmp[0x5cd89a]
[cluster128:493841] [12] lmp[0x565724]
[cluster128:493841] [13] lmp[0x4696e6]
[cluster128:493841] [14] lmp[0x4699ce]
[cluster128:493841] [15] lmp[0x44a0fd]
[cluster128:493841] [16] /lib64/libc.so.6(__libc_start_main+0xf3)[0x15553b38dca3]
[cluster128:493841] [17] lmp[0x44b37e]
[cluster128:493841] *** End of error message ***
[cluster125:518507:0:518507] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1553e6020400)
==== backtrace (tid: 518507) ====
 0  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x1554d7617894]
 1  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31a4f) [0x1554d7617a4f]
 2  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucs.so.0(+0x31d16) [0x1554d7617d16]
 3  /lib64/libc.so.6(+0xcfce3) [0x15553b422ce3]
 4  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69) [0x1554d7af7109]
 5  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd) [0x1554d7b5e5fd]
 6  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7) [0x1554d6705b17]
 7  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7) [0x1554d7b606a7]
 8  /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c) [0x1554d7b7097c]
 9  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9) [0x1554dc15abb9]
10  /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b) [0x1555547a745b]
11  lmp() [0x6bfca4]
12  lmp() [0x909856]
13  lmp() [0x5cd89a]
14  lmp() [0x565724]
15  lmp() [0x4696e6]
16  lmp() [0x4699ce]
17  lmp() [0x44a0fd]
18  /lib64/libc.so.6(__libc_start_main+0xf3) [0x15553b38dca3]
19  lmp() [0x44b37e]
=================================
[cluster125:518507] *** Process received signal ***
[cluster125:518507] Signal: Segmentation fault (11)
[cluster125:518507] Signal code:  (-6)
[cluster125:518507] Failing at address: 0x47a0007e96b
[cluster125:518507] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x155554c92ce0]
[cluster125:518507] [ 1] /lib64/libc.so.6(+0xcfce3)[0x15553b422ce3]
[cluster125:518507] [ 2] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_dt_pack+0x69)[0x1554d7af7109]
[cluster125:518507] [ 3] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc45fd)[0x1554d7b5e5fd]
[cluster125:518507] [ 4] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xd7)[0x1554d6705b17]
[cluster125:518507] [ 5] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(+0xc66a7)[0x1554d7b606a7]
[cluster125:518507] [ 6] /cm/shared/userapps/server/external/milan/ucx-v2/1.14.1-gcc11.2.0/lib/libucp.so.0(ucp_tag_send_nbx+0x89c)[0x1554d7b7097c]
[cluster125:518507] [ 7] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf9)[0x1554dc15abb9]
[cluster125:518507] [ 8] /cm/shared/userapps/server/external/milan-a100/openmpi/4.1.6-gcc11.2.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x1555547a745b]
[cluster125:518507] [ 9] lmp[0x6bfca4]
[cluster125:518507] [10] lmp[0x909856]
[cluster125:518507] [11] lmp[0x5cd89a]
[cluster125:518507] [12] lmp[0x565724]
[cluster125:518507] [13] lmp[0x4696e6]
[cluster125:518507] [14] lmp[0x4699ce]
[cluster125:518507] [15] lmp[0x44a0fd]
[cluster125:518507] [16] /lib64/libc.so.6(__libc_start_main+0xf3)[0x15553b38dca3]
[cluster125:518507] [17] lmp[0x44b37e]
[cluster125:518507] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cluster125 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
  1. "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" then mpirun -np 1 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 1 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.000 seconds
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.463 | 3.463 | 3.463 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.715      49.571266      8625.7383      4085.2506      500
         1   0.059257535   -22222.115      48.973063      8625.7383      4088.3937      493.96624
         2   0.058174952   -22220.723      47.595317      8625.7383      4248.8101      480.06962
         3   0.056188694   -22218.758      45.647227      8625.7383      4532.3255      460.42023
         4   0.063291061   -22216.527      43.440155      8625.7383      4863.687       438.15862
         5   0.064231404   -22214.398      41.33311       8625.7383      5139.061       416.90594
Loop time of 0.167517 on 1 procs for 5 steps with 768 atoms

Performance: 5.158 ns/day, 4.653 hours/ns, 29.848 timesteps/s, 22.923 katom-step/s
78.2% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.162      | 0.162      | 0.162      |   0.0 | 96.71
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 5.9083e-05 | 5.9083e-05 | 5.9083e-05 |   0.0 |  0.04
Output  | 0.0053124  | 0.0053124  | 0.0053124  |   0.0 |  3.17
Modify  | 8.5556e-05 | 8.5556e-05 | 8.5556e-05 |   0.0 |  0.05
Other   |            | 5.915e-05  |            |       |  0.04

Nlocal:            768 ave         768 max         768 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           2196 ave        2196 max        2196 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:        60672 ave       60672 max       60672 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 60672
Ave neighs/atom = 79
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:02
  1. "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" then mpirun -np 2 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 2 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 2 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.001 seconds
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.344 | 3.344 | 3.344 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.717      49.571266      8625.7383      4085.2498      500
         1   0.065596869   -22222.114      48.973063      8625.7383      4088.3967      493.96624
         2   0.064871558   -22220.725      47.595317      8625.7383      4248.811       480.06962
         3   0.065833931   -22218.757      45.647227      8625.7383      4532.3258      460.42023
         4   0.065820916   -22216.527      43.440155      8625.7383      4863.686       438.15862
         5   0.062747506   -22214.399      41.33311       8625.7383      5139.0638      416.90594
Loop time of 0.15456 on 2 procs for 5 steps with 768 atoms

Performance: 5.590 ns/day, 4.293 hours/ns, 32.350 timesteps/s, 24.845 katom-step/s
95.3% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.15113    | 0.15115    | 0.15116    |   0.0 | 97.79
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0.00023753 | 0.00025382 | 0.00027011 |   0.0 |  0.16
Output  | 0.0029648  | 0.0029992  | 0.0030336  |   0.1 |  1.94
Modify  | 7.0852e-05 | 7.3768e-05 | 7.6684e-05 |   0.0 |  0.05
Other   |            | 8.792e-05  |            |       |  0.06

Nlocal:            384 ave         384 max         384 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost:           1637 ave        1637 max        1637 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs:        30336 ave       30336 max       30336 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 60672
Ave neighs/atom = 79
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:02
  1. "export CUDA_VISIBLE_DEVICES=" and "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" then mpirun -np 1 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.003 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 1 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.000 seconds
PairE3GNNParallel using device : CPU
PairE3GNNParallel cuda-aware mpi: False
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.463 | 3.463 | 3.463 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.717      49.571266      8625.7383      4085.2502      500
         1   0.0027308085  -22222.113      48.973063      8625.7383      4088.3953      493.96624
         2   0.0027373388  -22220.725      47.595317      8625.7383      4248.8091      480.06962
         3   0.0027423737  -22218.758      45.647226      8625.7383      4532.3281      460.42022
         4   0.0027388049  -22216.527      43.440154      8625.7383      4863.6816      438.15861
         5   0.002755007   -22214.4        41.333109      8625.7383      5139.0583      416.90593
Loop time of 3.64967 on 1 procs for 5 steps with 768 atoms

Performance: 0.237 ns/day, 101.380 hours/ns, 1.370 timesteps/s, 1.052 katom-step/s
86.7% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 3.6434     | 3.6434     | 3.6434     |   0.0 | 99.83
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0.0001158  | 0.0001158  | 0.0001158  |   0.0 |  0.00
Output  | 0.005878   | 0.005878   | 0.005878   |   0.0 |  0.16
Modify  | 0.00016175 | 0.00016175 | 0.00016175 |   0.0 |  0.00
Other   |            | 6.991e-05  |            |       |  0.00

Nlocal:            768 ave         768 max         768 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           2196 ave        2196 max        2196 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:        60672 ave       60672 max       60672 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 60672
Ave neighs/atom = 79
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:04
  1. "export CUDA_VISIBLE_DEVICES=" and "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" then mpirun -np 2 lmp -in.lmp
LAMMPS (2 Aug 2023 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

The 'box' command has been removed and will be ignored

Reading data file ...
  triclinic box = (0 0 0) to (10.129786 10.371119 10.263143) with tilt (1.7303548 0 0)
  1 by 2 by 1 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
  triclinic box = (0 0 0) to (20.259573 20.742238 20.526287) with tilt (3.4607097 0 0)
  1 by 2 by 1 MPI processor grid
  768 atoms
  replicate CPU = 0.001 seconds
PairE3GNNParallel using device : CPU
PairE3GNNParallel cuda-aware mpi: False
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 6
  ghost atom cutoff = 6
  binsize = 3, bins = 8 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.344 | 3.344 | 3.344 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.717      49.571266      8625.7383      4085.2515      500
         1   0.0042590339  -22222.114      48.973063      8625.7383      4088.3922      493.96624
         2   0.0043118229  -22220.724      47.595317      8625.7383      4248.8161      480.06962
         3   0.0043031226  -22218.756      45.647226      8625.7383      4532.3296      460.42022
         4   0.0042611664  -22216.527      43.440154      8625.7383      4863.6858      438.15862
         5   0.0041990195  -22214.401      41.33311       8625.7383      5139.0571      416.90594
Loop time of 2.34462 on 2 procs for 5 steps with 768 atoms
Performance: 0.369 ns/day, 65.128 hours/ns, 2.133 timesteps/s, 1.638 katom-step/s
92.4% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 2.1157     | 2.2281     | 2.3405     |   7.5 | 95.03
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0.0002515  | 0.11262    | 0.22499    |  33.5 |  4.80
Output  | 0.0036482  | 0.003689   | 0.0037298  |   0.1 |  0.16
Modify  | 0.00012215 | 0.0001266  | 0.00013104 |   0.0 |  0.01
Other   |            | 0.0001017  |            |       |  0.00

Nlocal:            384 ave         384 max         384 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost:           1637 ave        1637 max        1637 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs:        30336 ave       30336 max       30336 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 60672
Ave neighs/atom = 79
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:04

Hope these helps. Please let me know if you have any further requests.

@YutackPark
Copy link
Member

Huge thanks for the very detailed report. As it seems like the only non-working case is cuda-aware mpi + multi-node + multi GPU, I'll try to reproduce it locally.

Before then, you can simply use export OFF_E3GNN_PARALLEL_CUDA_MPI=1 to use multi GPU. I think in this particular case where communication is necessarily done via inter-node cable, cuda-aware mpi won't help that much (I'm not sure as this setup is uncommon). The reason you're not seeing speed-up is because of the size of the test system. You need a lot of atoms to fully utilize each GPU. If they are underutilized due to small number of atoms, there is no speed-up.

Thanks again for the detailed report. If you don't want export that things, you can simply recompile lammps with mpi that is not cuda-aware.

@turbosonics
Copy link
Author

turbosonics commented Jan 28, 2025

Huge thanks for the very detailed report. As it seems like the only non-working case is cuda-aware mpi + multi-node + multi GPU, I'll try to reproduce it locally.

Before then, you can simply use export OFF_E3GNN_PARALLEL_CUDA_MPI=1 to use multi GPU. I think in this particular case where communication is necessarily done via inter-node cable, cuda-aware mpi won't help that much (I'm not sure as this setup is uncommon). The reason you're not seeing speed-up is because of the size of the test system. You need a lot of atoms to fully utilize each GPU. If they are underutilized due to small number of atoms, there is no speed-up.

Thanks again for the detailed report. If you don't want export that things, you can simply recompile lammps with mpi that is not cuda-aware.

Thank you.

I understand the official example plays with small number of atoms. I will test bigger systems with "export OFF_E3GNN_PARALLEL_CUDA_MPI=1".

The reason I hope to utilize multi-GPU is to overcome OOM crash. With our server, 8k to 9k atoms for condensed phase solid material, and 5k to 7k water molecules are maximum limit with 1 GPU node before facing OOM crash. Any bigger systems require 2+ GPU nodes. (All MDs for my system used pre-trained model of July 11 one. For the example folder tests, I used pre-trained model in the example folder.)

But then, what is the role of "export OFF_E3GNN_PARALLEL_CUDA_MPI=1" and how much this would impact on simulation speed compared to "normal" cases?

Let me know if you need anything more, and I hope any bypass or patch comes soon.

Thanks

@YutackPark
Copy link
Member

Sorry for the lack of explanation. OFF_E3GNN_PARALLEL_CUDA_MPI determines which communication routine sevennet will use. If MPI is compiled with CUDA-aware, sevennet can communicate between the GPUs without transferring data into CPU which improves parallel performance. If mpi is not compiled with CUDA-aware, the communication routine becomes something like: GPU1 -> CPU (process 1) -> CPU (process 2) -> GPU2, which is inefficient.

These two approach have different code to execute and default is if cuda-aware mpi is available, use it as it is more faster. OFF_E3GNN_PARALLEL_CUDA_MPI overwrites this choice and use CPU to communicate data.

@turbosonics
Copy link
Author

Sorry for the lack of explanation. OFF_E3GNN_PARALLEL_CUDA_MPI determines which communication routine sevennet will use. If MPI is compiled with CUDA-aware, sevennet can communicate between the GPUs without transferring data into CPU which improves parallel performance. If mpi is not compiled with CUDA-aware, the communication routine becomes something like: GPU1 -> CPU (process 1) -> CPU (process 2) -> GPU2, which is inefficient.

These two approach have different code to execute and default is if cuda-aware mpi is available, use it as it is more faster. OFF_E3GNN_PARALLEL_CUDA_MPI overwrites this choice and use CPU to communicate data.

Thanks. Then, would this be a openmpi problem? But our server admin assured me the openmpi we are using is CUDA-aware one... Anyway, I hope this problem resolved asap. Thanks!

@YutackPark
Copy link
Member

Then, would this be a openmpi problem?
Yes. SevenNet can use both the CUDA-aware and not-aware MPI but incorrect use of MPI module can be a problem. Modules in supercomputer centers are not guaranteed to work on every computer nodes it have. Modules are usually hardware-specific and each node can have different hardware. It is possible that certain modules are targeted to be used in specific nodes (partition).

@turbosonics
Copy link
Author

Yes. SevenNet can use both the CUDA-aware and not-aware MPI but incorrect use of MPI module can be a problem. Modules in supercomputer centers are not guaranteed to work on every computer nodes it have. Modules are usually hardware-specific and each node can have different hardware. It is possible that certain modules are targeted to be used in specific nodes (partition).

Thanks. Any recommended openmpi or other mpi modules for sevennet? I know it is hardware-specific, but there are so many versions, at least I hope to know which mpi would be the good choice to begin the test.

Meanwhile, if you guys could conduct tests on this to narrow it down further, that would be great. Thank you.

@YutackPark
Copy link
Member

YutackPark commented Jan 29, 2025

CUDA-aware mpi + CUDA + 2 GPU + md_parallel_example is turned out to be safe. The log:

PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: True
Generated 0 of 1 mixed pair_coeff terms from geometric mixing ruleNeighbor list info ...  update: every = 1 steps, delay = 0 steps, check = yes  max neighbors/atom: 2000, page size: 100000  master list distance cutoff = 6  ghost atom cutoff = 6  binsize = 3, bins = 8 7 7  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair e3gnn/parallel, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
[W129 14:58:54.649093819 TensorAdvancedIndexing.cpp:231] Warning: The reduce argument of torch.scatter with Tensor src is deprecated and will be removed in a future PyTorch release. Use torch.scatter_reduce instead for more reduction options. (function operator())
[W129 14:58:54.649151820 TensorAdvancedIndexing.cpp:231] Warning: The reduce argument of torch.scatter with Tensor src is deprecated and will be removed in a future PyTorch release. Use torch.scatter_reduce instead for more reduction options. (function operator())
Per MPI rank memory allocation (min/avg/max) = 3.344 | 3.344 | 3.344 Mbytes
   Step         T/CPU          PotEng         KinEng         Volume         Press           Temp
         0   0             -22222.717      49.571266      8625.7383      4085.2407      500
         1   0.054480366   -22222.114      48.973063      8625.7383      4088.391       493.96624
         2   0.051336446   -22220.725      47.595317      8625.7383      4248.8111      480.06961
         3   0.052770978   -22218.757      45.647225      8625.7383      4532.3253      460.42021
         4   0.051208429   -22216.527      43.440153      8625.7383      4863.6884      438.15861
         5   0.05160953    -22214.399      41.33311       8625.7383      5139.0678      416.90593
Loop time of 0.192437 on 2 procs for 5 steps with 768 atoms

Performance: 4.490 ns/day, 5.345 hours/ns, 25.983 timesteps/s, 19.955 katom-step/s
98.8% CPU use with 2 MPI tasks x 1 OpenMP threads

Maybe I could test multi-node with this I recommend you consult with your server administrator. The e3gnn/parallel, relies on MPI backend for communication and have no code or assumption of multi-node things.

Any recommended openmpi or other mpi modules for sevennet?

I don't have such. One day I tried to compile OpenMPI from source for exercise and failed :P It was out of my domain.
However, if you have 'NVHPC' module installed in your cluster, try it. It is a development kit deployed from NVIDIA and has mpi compiled with CUDA. and I'm sure they have managed to do all the dependency things as they offer it as a single package. (They support CUDA, compiler, and mpi) Note that it may tricky to compile LAMMPS with it. In my case, its CUDA was hidden in some folder and LAMMPS failed to detect it automatically. So I manually configured it.

@turbosonics
Copy link
Author

turbosonics commented Jan 29, 2025

Maybe I could test multi-node with this I recommend you consult with your server administrator. The e3gnn/parallel, relies on MPI backend for communication and have no code or assumption of multi-node things.

Thanks for your test. Could you perform tests for 2 nodes while utilizing only 1 GPU for each node? That would be similar to our environment.

I don't have such. One day I tried to compile OpenMPI from source for exercise and failed :P It was out of my domain.
However, if you have 'NVHPC' module installed in your cluster, try it. It is a development kit deployed from NVIDIA and has mpi compiled with CUDA. and I'm sure they have managed to do all the dependency things as they offer it as a single package. (They support CUDA, compiler, and mpi) Note that it may tricky to compile LAMMPS with it. In my case, its CUDA was hidden in some folder and LAMMPS failed to detect it automatically. So I manually configured it.

I will contact server admin about NVHPC. Thanks.

@anveshnathaniou
Copy link

Hello,
I successfully completed a simulation with 12,000 atoms. However, when I attempt to run it with 50,000 atoms, the following error persists. Do you have any suggestions on how to scale up to 100,000 atoms using a GPU? I can run the same simulation on a CPU without errors, but it is impractical due to the extremely slow performance.

##############################################################################################

RuntimeError: CUDA out of memory. Tried to allocate 13.71 GiB. GPU 0 has a total capacity of 79.10 GiB of which 5.27 GiB is free. Including non-PyTorch memory, this process has 73.82 GiB memory in use. Of the allocated memory 69.16 GiB is allocated by PyTorch, and 3.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

##############################################################################################
Reading data file ...
orthogonal box = (0 0 0) to (75.442637 130.67048 61.598655)
2 by 2 by 1 MPI processor grid
reading atoms ...
54000 atoms
read_data CPU = 0.149 seconds
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
cuda-aware mpi not found, communicate via host device
PairE3GNNParallel using device : CUDA
PairE3GNNParallel cuda-aware mpi: False
Changing box ...
orthogonal box = (7.5442637 0 0) to (67.898373 130.67048 61.598655)
orthogonal box = (7.5442637 13.067048 0) to (67.898373 117.60343 61.598655)
orthogonal box = (7.5442637 13.067048 6.1598655) to (67.898373 117.60343 55.438789)
Changing box ...
orthogonal box = (7.5065424 13.067048 6.1598655) to (67.936094 117.60343 55.438789)
orthogonal box = (7.5065424 13.001713 6.1598655) to (67.936094 117.66877 55.438789)
orthogonal box = (7.5065424 13.001713 6.1290662) to (67.936094 117.66877 55.469589)
WARNING: No fixes with time integration, atoms won't move (src/verlet.cpp:60)
Generated 0 of 3 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 5.3
ghost atom cutoff = 5.3
binsize = 2.65, bins = 23 40 19
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair e3gnn/parallel, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
lammps out

@anveshnathaniou
Copy link

These are my configs
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
mpirun lmp_mpi -in lmp_0K.in

@turbosonics
Copy link
Author

These are my configs #SBATCH --nodes=2 #SBATCH --gres=gpu:2 #SBATCH --ntasks-per-node=2 mpirun lmp_mpi -in lmp_0K.in

I think you should write a new post about this instead of replying to other's issue post about different topic. But let me answer anyway. OOM crash means you need to request more GPU nodes for that job. Each GPU node has limited amount of memory, and memory from two nodes aren't enough to describe all atoms and their interactions.

@turbosonics
Copy link
Author

One more question, instead of "best" or "optimal" options, could you just share me which version of openmpi (or other mpi) and cudnn are used for any working version of SevenNet-LAMMPS? I hope to test different openmpis but blind test everything is crazy, I hope to have some starting point...

@YutackPark
Copy link
Member

Hi @turbosonics

In my case, OpenMPI versions were fine all the time. What you used in this issue is also fine except for CUDA-aware MPI.
For the version I successfully ran CUDA-aware MPI with multi GPU (not multi-node tough, our local cluster):

[parkyutack@odin ~]$ module load NV_HPC/22.7
[parkyutack@odin ~]$ which mpirun
/TGM/Apps/NVHPC/2022_227/Linux_x86_64/22.7/comm_libs/mpi/bin/mpirun
[parkyutack@odin ~]$ mpirun --version
mpirun (Open MPI) 3.1.5

Report bugs to http://www.open-mpi.org/community/help/

As the path says this is the openmpi embedded in NV_HPC kit. For the version I used in doing the experiments in the paper (where multi-node successfully worked) It says like this (copied from the paper):

All benchmarks utilize a GPU cluster system comprising HPE Apollo 6500 Gen10 nodes interconnected via InfiniBandHDR200. Each node is equipped with 8 NVIDIA A100 80 GBGPUs interconnected with NVLink. LAMMPS (version23Jun2022�Update 4) is compiled using OpenMPI/4.1.2, configured for CUDA-aware MPI and integrated with theLibTorch library from PyTorch/1.12.0. Compilation utilizes the nvcc compiler of the CUDA/11.6.2 and gcc/9.4.0. One CPU core is assigned to one GPU card and MPI rank.

@turbosonics
Copy link
Author

Hi @turbosonics

In my case, OpenMPI versions were fine all the time. What you used in this issue is also fine except for CUDA-aware MPI. For the version I successfully ran CUDA-aware MPI with multi GPU (not multi-node tough, our local cluster):

[parkyutack@odin ~]$ module load NV_HPC/22.7
[parkyutack@odin ~]$ which mpirun
/TGM/Apps/NVHPC/2022_227/Linux_x86_64/22.7/comm_libs/mpi/bin/mpirun
[parkyutack@odin ~]$ mpirun --version
mpirun (Open MPI) 3.1.5

Thank you so much for more info. Could you conduct the same LAMMPS-SevenNet tests with 2 or 4 GPU nodes with 1 GPU per each node? I just hope to isolate the reason behind this further.

Due to crazy expensive price of GPU these days, I'm sure some people will try to use SevenNet with similar or worse environment, and I think it would be great for SevenNet to make sure of this...

As the path says this is the openmpi embedded in NV_HPC kit. For the version I used in doing the experiments in the paper (where multi-node successfully worked) It says like this (copied from the paper):

All benchmarks utilize a GPU cluster system comprising HPE Apollo 6500 Gen10 nodes interconnected via InfiniBandHDR200. Each node is equipped with 8 NVIDIA A100 80 GBGPUs interconnected with NVLink. LAMMPS (version23Jun2022�Update 4) is compiled using OpenMPI/4.1.2, configured for CUDA-aware MPI and integrated with theLibTorch library from PyTorch/1.12.0. Compilation utilizes the nvcc compiler of the CUDA/11.6.2 and gcc/9.4.0. One CPU core is assigned to one GPU card and MPI rank.

Thanks to remind this. I will discuss about NVHPC with our server admin.

From our cluster, I've been testing using openmpi 4.1.1 & 4.1.7 + Cuda 12.1 + Pytorch 2.2.2 (prebuilt one and also tried manually compiled one) + cuDNN 8.1.1 + gcc/11.2.0.

I tested both openmpi 4.1.1 and 4.1.7 (installed by server admin), and I tried prebuilt version and also manually compiled Pytorch 2.2.2 for both openmpi conditions. All tests have been conducted from separate venv. And I saw the same thing happens with my test results, multi-GPU-node only works when CPU performs the node-to-node communication.

Since two openmpi version had a same issue, I think the problem might be openmpi version, but not sure what is the reason...

@YutackPark YutackPark changed the title LAMMPS parallel MD simulations crashes soon after begin LAMMPS parallel MD simulations crashes soon after begin [multi-node + CUDA-aware MPI] Jan 31, 2025
@turbosonics
Copy link
Author

@YutackPark

Sorry to ask this again, but could you please test 2 or 4 GPU node with one GPU each case using example and see if the current version of SevenNet-LAMMPS works from your environment or not?

During last 2 weeks, I've been testing various versions of openmpi and pytorch combinations but all failed. It seems like putting more effort become waste of time.

All I hope to see is the confirmation. The confirmation that this same crash happens from different hardware conditions but with similar GPU setting. Then I can peacefully stop this endless testing. And I need to share this to our server people that this is not our server problem.

@turbosonics
Copy link
Author

@YutackPark
We got NVHPC module available for our local cluster. As expected, the crash occurs during LAMMPS-SevenNet configuration. May I ask how did you set up your cmake variables?

I loaded nvhpc and cmake module. Then I modified and used nvhpc.cmake preset for nvc++, nvc, nvfortran, and mpicxx, and used that preset.

But configuration error says nvcc is broken. But when I check $ nvcc --version, it prints out proper output.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Would it be possible the nvhpc of our cluster used CUDA 10 instead of CUDA11 or 12?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants