Measuring memory bandwidth using streaming kernels
TheBandwidthBenchmark as a Julia package.
Citing from the original source:
It contains the following streaming kernels with corresponding data access pattern (Notation: S - store, L - load, WA - write allocate). All variables are vectors, s is a scalar:
- init (S1, WA): Initilize an array:
a = s
. Store only.- sum (L1): Vector reduction:
s += a
. Load only.- copy (L1, S1, WA): Classic memcopy:
a = b
.- update (L1, S1): Update vector:
a = a * scalar
. Also load + store but without write allocate.- triad (L2, S1, WA): Stream triad:
a = b + c * scalar
.- daxpy (L2, S1): Daxpy (without write allocate):
a = a + b * scalar
.- striad (L3, S1, WA): Schoenauer triad:
a = b + c * d
.- sdaxpy (L3, S1): Schoenauer triad without write allocate:
a = a + b * c
.
Note, that we do not (yet) include the sum reduction in this Julia package.
] add BandwidthBenchmark
All benchmarks:
- Emmy @ FAU: benchmarks/emmy_fau
- Noctua1 @ PC²: benchmarks/noctua_pc2
- Noctua2 @ PC²: benchmarks/noctua2_pc2
The benchmarks showcased in this README have been conducted on the Emmy cluster at NHR@FAU because we could fix the CPU frequencies on this system.
Keyword arguments:
N
(default:120_000_000
): length of vectorsnthreads
(default:Threads.nthreads()
): number of Julia threads to useniter
(default:10
): # of times we repeat the measurementalignment
(default:64
): array alignmentverbose
(default:false
): print result table + thread information etc.write_allocate
(default:false
): include write allocate compensation factors
It is highly recommended(!) to pin the Julia threads to specific cores (according to the architecture at hand). The simplest way is probably to set JULIA_EXCLUSIVE=1
, which will pin Julia threads to the first N
cores of the system. For more specific pinning, ThreadPinning.jl, LIKIWD.jl, or other tools like numactl
may be useful.
julia> using BandwidthBenchmark
julia> bwbench(; verbose=true);
Threading enabled, using 8 (of 8) Julia threads
Total allocated datasize: 3840.0 MB
Thread 6 running on core 5.
Thread 5 running on core 4.
Thread 2 running on core 1.
Thread 3 running on core 2.
Thread 7 running on core 6.
Thread 8 running on core 7.
Thread 1 running on core 0.
Thread 4 running on core 3.
┌──────────┬─────────────┬────────────────┬───────────┬───────────┬───────────┐
│ Function │ Rate (MB/s) │ Rate (MFlop/s) │ Avg time │ Min time │ Max time │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├──────────┼─────────────┼────────────────┼───────────┼───────────┼───────────┤
│ Init │ 18972.0 │ 0.0 │ 0.0508247 │ 0.0506008 │ 0.0510763 │
│ Copy │ 27137.6 │ 0.0 │ 0.0710205 │ 0.0707507 │ 0.0715388 │
│ Update │ 37939.9 │ 2371.24 │ 0.0509081 │ 0.0506063 │ 0.0512734 │
│ Triad │ 30753.6 │ 2562.8 │ 0.0943514 │ 0.0936477 │ 0.094945 │
│ Daxpy │ 40548.7 │ 3379.05 │ 0.071467 │ 0.0710258 │ 0.0719743 │
│ STriad │ 33344.3 │ 2084.02 │ 0.115823 │ 0.115162 │ 0.11737 │
│ SDaxpy │ 41370.0 │ 2585.62 │ 0.0935033 │ 0.0928209 │ 0.0941559 │
└──────────┴─────────────┴────────────────┴───────────┴───────────┴───────────┘
When LIKWID.jl is loaded, bwbench
will automatically try to use LIKWIDs Marker API! Running e.g.
JULIA_EXCLUSIVE=1 likwid-perfctr -c 0-7 -g MEM_DP -m julia --project=. --math-mode=fast -t8 bwbench_likwid.jl
one gets detailed information from hardware-performance counters: example output. Among other things, we can use these values to check / prove that write allocates have happened. Inspecting the memory bandwith associated with read and write in the STRIAD region, extracted here for convenience,
Memory read bandwidth [MBytes/s] | 31721.7327
Memory write bandwidth [MBytes/s] | 8014.1479
we see that 31721.7327 / 8014.1479 ≈ 4
times more reads have happened than writes. Naively, one would expect 3 loads and 1 store, i.e. a factor of 3 instead of 4 for the Schönauer Triad a[i] = b[i] + c[i] * d[i]
. The additional load is due to the write allocate (a
must be loaded before it can be written to).
If we know that write allocates happen (as is usually the case), we can pass write_allocate=true
to bwbench
to account for the extra loads. In this case, we obtain the following table
┌──────────┬─────────────┬────────────────┬───────────┬───────────┬───────────┐
│ Function │ Rate (MB/s) │ Rate (MFlop/s) │ Avg time │ Min time │ Max time │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├──────────┼─────────────┼────────────────┼───────────┼───────────┼───────────┤
│ Init │ 38150.7 │ 0.0 │ 0.0506281 │ 0.0503267 │ 0.0508563 │
│ Copy │ 40560.1 │ 0.0 │ 0.0712144 │ 0.0710057 │ 0.0714202 │
│ Update │ 37862.3 │ 2366.39 │ 0.0512318 │ 0.0507101 │ 0.05211 │
│ Triad │ 40979.8 │ 2561.24 │ 0.0945374 │ 0.0937046 │ 0.0949424 │
│ Daxpy │ 40347.0 │ 3362.25 │ 0.0719729 │ 0.0713808 │ 0.07248 │
│ STriad │ 41754.5 │ 2087.73 │ 0.115761 │ 0.114958 │ 0.117923 │
│ SDaxpy │ 41276.1 │ 2579.75 │ 0.093708 │ 0.0930321 │ 0.0949028 │
└──────────┴─────────────┴────────────────┴───────────┴───────────┴───────────┘
Note that there are no write allocates for Update
, Daxpy
, and SDaxpy
.
Use bwscaling()
to measure the memory bandwidth for an increasing number of threads (1:max_nthreads
).
Keyword arguments:
max_nthreads
(default:Threads.nthreads()
): upper limit for the number of threads to be used
1 Thread(s): SDaxpy Bandwidth (MB/s) is 13047.85
Threading enabled, using 2 (of 10) Julia threads
2 Thread(s): SDaxpy Bandwidth (MB/s) is 24023.27
Threading enabled, using 3 (of 10) Julia threads
3 Thread(s): SDaxpy Bandwidth (MB/s) is 31636.02
Threading enabled, using 4 (of 10) Julia threads
4 Thread(s): SDaxpy Bandwidth (MB/s) is 36304.15
Threading enabled, using 5 (of 10) Julia threads
5 Thread(s): SDaxpy Bandwidth (MB/s) is 38594.68
Threading enabled, using 6 (of 10) Julia threads
6 Thread(s): SDaxpy Bandwidth (MB/s) is 40556.33
Threading enabled, using 7 (of 10) Julia threads
7 Thread(s): SDaxpy Bandwidth (MB/s) is 40692.84
Threading enabled, using 8 (of 10) Julia threads
8 Thread(s): SDaxpy Bandwidth (MB/s) is 39040.98
Threading enabled, using 9 (of 10) Julia threads
9 Thread(s): SDaxpy Bandwidth (MB/s) is 39417.47
Threading enabled, using 10 (of 10) Julia threads
10 Thread(s): SDaxpy Bandwidth (MB/s) is 39913.76
Scaling results:
┌───────────┬─────────────────────────┐
│ # Threads │ SDaxpy Bandwidth (MB/s) │
├───────────┼─────────────────────────┤
│ 1.0 │ 13047.8 │
│ 2.0 │ 24023.3 │
│ 3.0 │ 31636.0 │
│ 4.0 │ 36304.1 │
│ 5.0 │ 38594.7 │
│ 6.0 │ 40556.3 │
│ 7.0 │ 40692.8 │
│ 8.0 │ 39041.0 │
│ 9.0 │ 39417.5 │
│ 10.0 │ 39913.8 │
└───────────┴─────────────────────────┘
Bandwidth Scaling
+----------------------------------------+
50000 | |
| |
| |
| ._.------__. ._____|
| ._-*"' '"""' |
| _*"` |
| .r/ |
MB/s | ./ |
| ./` |
| .' |
| .` |
| .` |
| /` |
|/ |
10000 | |
+----------------------------------------+
1 10
# cores
Use bwscaling_memory_domains()
(instead of bwscaling
above) to measure the memory bandwidth across memory domains (NUMA). Returns a DataFrame
in which each row contains the kernel name,
the number of threads per memory domain, the number of domains considered, and the
measured memory bandwidth (in MB/s).
Keyword arguments:
max_nnuma
(default:ThreadPinning.nnuma()
): maximum number of memory domains to considermax_nthreads
(default:maximum(ThreadPinning.ncores_per_numa())
): maximum number of threads per memory domain to consider
Since we also estimate the MFlops/s for our streaming kernels, we can also investigate the scaling of the floating point performance for increasing number of threads (1:max_nthreads
). The function flopsscaling
does exactly that based on the Triad kernel.
Keyword arguments:
max_nthreads
(default:Threads.nthreads()
): upper limit for the number of threads to be used
Using flopsscaling
, we can, for example, benchmark and compare the performance of different thread pinning scenarios:
- Compact Pinning: fill physical cores chronologically, i.e. first socket first, second socket second.
- Scattered Pinning: fill both sockets simultaneously, i.e. alternating between first and second socket
Compact Pinning
+----------------------------------------+
10000 | |
| ./|
| ./ |
| ..-" |
| .*' |
| .-/` |
| _*` |
MFlops/s | .r" |
| .. .__._.-' |
| _-"'"""` ` |
| ./ |
| .` |
| r` |
| / |
1000 | / |
+----------------------------------------+
0 20
# cores
Scattered Pinning
+----------------------------------------+
10000 | |
| . ., /\..*\. .\..r|
| .'**`\..` \/ "' |
| ,\ .` \` |
| . \. |
| . ,` ` |
| ./'.. |
MFlops/s | ./ ` |
| ./ |
| .` |
| .` |
| /` |
| / |
| / |
1000 | / |
+----------------------------------------+
0 20
# cores
(For a discussion, see https://discourse.julialang.org/t/compact-vs-scattered-pinning/69722)
- TheBandwidthBenchmark by RRZE-HPC Erlangen
- Sister package STREAMBenchmark.jl
- LIKWID and LIKIWD.jl