-
Notifications
You must be signed in to change notification settings - Fork 113
Benchmark_ITT: Example output
The important score is at the end of the output of Benchmark_ITT:
Grid : Message : 153.294206 s : ==================================================================================
Grid : Message : 153.294221 s : Per Node Summary table Ls=12
Grid : Message : 153.294235 s : ==================================================================================
Grid : Message : 153.294247 s : L Wilson DWF4 Staggered
Grid : Message : 153.294258 s : 8 78414.022 801369.528 15600.473
Grid : Message : 153.294278 s : 12 365420.689 2042350.152 125976.526
Grid : Message : 153.294297 s : 16 936905.225 3940917.595 146384.177
Grid : Message : 153.294316 s : 24 2456508.219 4861140.649 265750.399
Grid : Message : 153.294335 s : 32 2257740.031 5776951.106 285434.293
Grid : Message : 153.294354 s : ==================================================================================
Grid : Message : 153.294366 s : ==================================================================================
Grid : Message : 153.294378 s : Comparison point result: 5319045.877 Mflop/s per node
Grid : Message : 153.294393 s : Comparison point is 0.5*(5776951.106+4861140.649)
Grid : Message : 153.294410 s : ==================================================================================
The result for the single node is 5.319 TF/s
Comparison point result: 5319045.877 Mflop/s per node
This system has 6 V100 GPUs per node
Invocation:
jsrun --smpiargs=-gpu --nrs 6 --rs_per_host 6 --tasks_per_rs 1 --cpu_per_rs 6 --gpu_per_rs 1 ./Benchmark_ITT --mpi 1.1.1.6 --shm 2048
Single node, 6 GPU, run log (5.3TF/s result).
This log gave the above result. The per node performance drops when more than one node is used. Summit has only dual rail EDR, and is network bandwidth limited giving around 1.2 TF/s per node on 8 nodes or more. Increased interconnect provision would make sense.
A100 is expected to perform best with 1:1 ratio of GPU's to 200 Gbit/s interfaces.
Configuration
../configure --enable-comms=mpi-auto --enable-simd=AVX2 --prefix /home/dp008/dp008/paboyle/prefix-cpu \
CXX=clang++ MPICXX=mpiicpc \
LDFLAGS=-L/home/dp008/dp008/paboyle/prefix/lib/ \
CXXFLAGS="-I/home/dp008/dp008/paboyle/prefix/include/ -std=c++11 -fpermissive"
Invocation
mpirun -np 2 ./Benchmark_ITT --mpi 1.1.1.2 --threads 12
231 Gflop/s per node result:
We have only had temporary access to dual socket Rome CPU's during a Summer 2020 AMD Hackathon and following weeks. Unfortunately this predated the freezing of our Benchmark_ITT.
We benchmarked the nodes using Benchmark_dwf, on a different volume, and the following slides were produced at the time.
Rome, 2 CPU, 64+64 cores Benchmark_dwf Slides (PDF)
The results (up to 1.9 TF/s on a carefully chosen volume) are likely an overestimate of the Benchmark_ITT as the ITT volume is larger and will likely spill from cache residency. We are unable to access appropriate nodes to verify this.
The cache edge is visible in a synthetic benchmark on the above slides page 2.
From Juelich Booster system single node which has 4 x A100 GPUs per node
Configuration
MPICXX=mpicxx ../configure \
--enable-unified=yes \
--enable-accelerator=cuda \
--enable-comms=mpi-auto \
--enable-simd=GPU \
CXX=nvcc \
CXXFLAGS="-ccbin g++ -gencode arch=compute_80,code=sm_80 -std=c++14" \
LIBS="-lrt -lmpi "
Invocation wrapped in NUMA aware script, carefully matched to lstopo output
srun -n 4 ./rungrid.sh
We have been informed that Grid hits issues for multinode GPU running on at least some systems and software versions. We reproduced this on the Booster System at Juelich with UCX v1.8.1 and OpenMPI 4.0.1rc1. At time of writing we have good reason to believe these are issues with the UCX software rather than Grid, but the problem has not been resolved to conclusion.
We have been told that
LDFLAGS="--cudart shared"
CXXFLAGS="--cudart shared"
addresses the issue with UCX. UCX intercepts CUDA calls at runtime to track memory regions, but only with dynamic linking. We have not been able to verify this due to machine availability. This page will be updated when we are able to check this solution.
**18 Nov: Update!!! Confirmed this works on Booster at Juelich **
Work around is deprecated since it is not needed.