-
Notifications
You must be signed in to change notification settings - Fork 113
Dirac ITT 2020 Benchmarks
The two Grid based benchmarks for the 2020 Dirac ITT are located in the build directories of the Grid package.
Build instructions for Grid are available on the Wiki at https://github.com/paboyle/Grid.
The key Grid benchmarks are located in branch:
release/dirac-ITT-2020
and in the corresponding release:
https://github.com/paboyle/Grid/releases
with binaries found under the build directories as:
benchmarks/Benchmark_ITT
and
benchmarks/Benchmark_IO
The code can be compiled for two major node types: multicore CPU and GPU.
Compile instructions/scripts are provided for
- Intel Skylake/IceLake processors (Tesseract)
- AMD Rome processors, single node, dual socket (Booster)
- Nvidia NV100 + Skylake nodes (Tesseract)
- Nvidia NV100 + IBM Power9 nodes (Summit)
- Nvidia A100 + AMD Rome nodes (Booster)
- AMD MI50 + AMD Rome nodes (Durham)
- A64FX
However, details will depend on cluster software configurations and these are illustrative only, and provided as examples to help vendors prepare submissions. These scripts compile the code twice (with build-Nc3 and build-Nc4), with the two different scenarios submitted but only Nc=3 contributing to the DR1 measure of system throughput.
These nodes are supported with grateful thanks to contributed work by Nils Meyer at Regensburg University. The compile target is
--enable-simd=A64FX
The code is claimed to work efficiently on Fujitsu ARM systems, but we have not been able to verify this ourselves. This target should enable both Fujitsu and Cray A64FX nodes to operate efficiently.
See: Nils Meyer's APLAT 2020 talk
We have compiled the code and run on MI-50 GPUs using the hip environment.
--enable-accelerator=hip
We have not had access to MI-100 GPU's, but anticipate these may be cost effective.
It must be run and submitted on both Nc=3 and Nc=4 compiles, with Nc=3 contributing to the system throughput metric.
- Single node runs on both CPU and (if GPU's are included in the proposal) single multi-GPU node and single GPU execution.
2a) 64 node run for CPU nodes (2Nx2x4x4) mpi decomposition where N is the number of sockets on a node
OR
2b) 16 node run (multi)-GPU nodes (2x2x2x2N) nodes and 2x2x2x2N GPU's, where N is the number of GPU's on one node.
Optionally:
- 2,4,8,16,32 and 64 node runs. Large job sizes will give us greater confidence of scaling extrapolations and are advantageous.
Log files should be collected after compile, run and threading parameters and compile options are optimised. These log files must be submitted in entirety, and organised in a way that the run invocation parameters are clearly communicated.
Single CPU node (if 2 socket)
Benchmark_ITT --mpi 2.1.1.1 --shm 2048
Single CPU node (if 1 socket)
Benchmark_ITT --mpi 1.1.1.1 --shm 2048
Single GPU node, 4 gpu's
Benchmark_ITT --mpi 1.1.1.4 --shm 2048
Single GPU node, 1 gpu
Benchmark_ITT --mpi 1.1.1.1 --shm 2048
64 node CPU (if 1 socket)
Benchmark_ITT --mpi 2.2.4.4 --shm 2048
64 node CPU, (if 2 socket)
Benchmark_ITT --mpi 4.2.4.4 --shm 2048
16 node GPU, 4 gpu's
Benchmark_ITT --mpi 2.2.2.8 --shm 2048
A baseline run must be submitted based on measurements of real hardware at the target node count. However, these may be modified by an adjustment to allow for changes to the proposed node specification to adjust for different node parts compared to the baseline run.
For example, runs on parts with fewer cores could be synthetically projected by running on a full core-count part with partial thread counts, which would be a high confidence projection. Another example could be an estimated effective of increasing or decreasing the number or type of fabric interfaces based on measurements with smaller node counts, where more detailed discussion of timing would be required.
Any such corrections to the throughput, along with supporting evidence upon which such corrections are estimated, must be clearly indicated and supporting evidence provided in the proposal. The quality of the supporting evidence for such projections will be scored and that score may be used to take a weighted average of the baseline and predicted runs with weighting to a confidence score of the supporting evidence.
The intent is to give vendors ability to fine tune system configuration without imposing a large cost or disruption in reconfiguring their internal clusters. The rules are designed to allow maximum flexibility while minimising the risk of substantially incorrect extrapolations propagating into the scoring.
The aggregate system performance score should be estimated as the sum of the estimated CPU and (if present) GPU partition throughputs via:
Naturally the result will be an aggregate total system Gflop/s throughput.
We recommend running with one MPI-rank per GPU, dividing CPU cores evenly.
Compute nodes without NVlink or Infinity fabric based will not deliver good performance and will not be considered.
While AVX2 and AVX512 are both options, even on processors that support AVX512 we sometimes find AVX2 runs as fast or faster, and both should be compared. Control threads using the standard OMP_NUM_THREADS shell variable.
The relevant numa affinity options in multicore nodes can make big changes to delivered performance. We recommend one MPI rank per numa domain with OMP_NUM_THREADS set to the number of cores in each NUMA domain.
Examples include
IMPI_PIN=1
for Intel MPI, and
--map-by numa
Under OpenMPI
If your MPI does not support process binding, invoke through a wrapper script to give MPI and CUDA no choice but to do the right thing. We give an example from the Booster system. The following wrapper script proved useful for CPU an GPU execution on Booster for NUMA and GPU-HFI proximity reasons.
#!/bin/sh
if [ $SLURM_LOCALID = "0" ]
then
NUMA=1
MLX=mlx5_0:1
GPU=1
fi
if [ $SLURM_LOCALID = "1" ]
then
NUMA=3
MLX=mlx5_1:1
GPU=0
fi
if [ $SLURM_LOCALID = "2" ]
then
NUMA=5
MLX=mlx5_2:1
GPU=3
fi
if [ $SLURM_LOCALID = "3" ]
then
NUMA=7
MLX=mlx5_3:1
GPU=2
fi
CMD="env CUDA_VISIBLE_DEVICES=$GPU UCX_NET_DEVICES=$MLX numactl -m $NUMA ./a.out"
echo $CMD
$CMD#!/bin/sh
if [ $SLURM_LOCALID = "0" ]
then
NUMA=1
MLX=mlx5_0:1
GPU=1
fi
if [ $SLURM_LOCALID = "1" ]
then
NUMA=3
MLX=mlx5_1:1
GPU=0
fi
if [ $SLURM_LOCALID = "2" ]
then
NUMA=5
MLX=mlx5_2:1
GPU=3
fi
if [ $SLURM_LOCALID = "3" ]
then
NUMA=7
MLX=mlx5_3:1
GPU=2
fi
CMD="env CUDA_VISIBLE_DEVICES=$GPU UCX_NET_DEVICES=$MLX numactl -m $NUMA ./a.out"
echo $CMD
$CMD
On CPU compiles, a globals comms buffer is allocated with either MMAP (default) or SHMGET if
--shm-hugepages
is specified then the software requests that Linux provide 2MB huge pages. This requires system administrator assistance to preserve and enable the user to map these pages. Hugetlbfs is also an appropriate solution. Any performance gain will vary with MPI fabric type.
For GPU compiles this comms buffer is device memory.
The I/O benchmark can be found under benchmarks/Benchmark_IO
in your Grid build directory.
This benchmark will write and read back large arrays of random numbers in parallel for a range of local volumes. For a given local volume L
the size of the written array is proc*12*L*16
bytes, where proc
is the number of MPI processes.
The program uses two strategies for I/O
- Each MPI process reads/writes its local array into a separate file using the C++ standard library.
- MPI processes collectively read/write in one single file using Grid's routine based on MPI-2 I/O.
When a file is read back, it undergoes a checksum verification, and the program aborts in case of checksum mismatch. Each file contains different random data. The benchmark will loop over local volumes between 8^4 and 32^4, and repeat the whole measurement 10 times.
At the end of the execution the benchmark will print summary tables on the standard output. These tables contain
- a summary of individual results averaged over the 10 passes, with their standard deviations;
- the robustness of individual results over the 10 passes in %, closer to 100% is better (robustness = 100% - standard deviation/mean);
- a summary of results averaged over the volume range 24^4-32^4, with their standard deviations;
- the robustness of the volume-averaged results.
Similarly to Benchmark_ITT
, it must be run on 64 CPU nodes using a (2x2x4x4) MPI decomposition.
Log files should be collected after compile, run and threading parameters and compile options are optimised. These log files must be submitted in entirety. Several results can be submitted for different file system parameters.
Identical to Benchmark_ITT
.
64 node CPU (if 1 socket)
Benchmark_IO --mpi 2.2.4.4
64 node CPU, (if 2 socket)
Benchmark_IO --mpi 4.2.4.4
With the Lustre file system, using stripping has a very significant impact on the results. In particular, optimal performances for strategies 1. and 2. might not be achieved with the same stripping parameters due to the different number of files. The benchmark writes files in the directory from which Benchmark_IO
was executed. To create a directory with specific stripping parameters, one uses
mkdir dir
lfs setstripe -c <no> -S <size> dir
where <no>
is the number of stripes and <size>
is the stripe size. Running Benchmark_IO
from within dir
will benchmark this specific choice of parameters.