Performance variation in single thread benchmark execution #893

rengolin · 2024-02-29T22:51:07Z

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and "slower" on Lake?

These numbers are consisten across multiple runs on our cluster, AWS virtual and AWS metal.

FP32:

ZEN3	DNN	TPP-MLIR	Delta
FP32	105.2	113.0	107%
BF16	91.5	92.6	101%
MLP32	105.6	112.3	106%
MLP16	92.0	93.1	101%

CLX	DNN	TPP-MLIR	Delta
FP32	172.8	165.3	96%
BF16	131.9	131.5	100%
MLP32	172.2	164.8	96%
MLP16	131.5	131.4	100%

BF16 on SPR:

rengolin · 2024-02-29T23:35:55Z

Some ideas:

Libxsmm-dnn uses mmap to allocate temporary buffers on 2M page boundaries, while the LLVM JITter probably doesn't.
This explain the ICX 4% slowdown, but not the Zen3 7% speedup
Maybe Zen3 doesn't work well with that practice?

alheinecke · 2024-03-01T17:02:29Z

#895 is the same problem, let's merge issues.

I have debugged it to the following extent:

it's only 1thr issue for very small problems
when changing libxsmm-dnn to use malloc instead of libxsmm_aligned_malloc the gap gets much smaller (libxsmm-dnn performance drops)
For larger sizes, e.g. C=K=2048 or Minibatch=1024m single thread performance of libxsmm-dnn and tpp-mlir is identical.

--> "solution" let's run benchmarks on some slightly larger problem sizes, where data is large than 2M pages etc.

rengolin · 2024-03-01T17:41:24Z

We can also try alignment attribute on memref alloc. Doesn't hurt.

alheinecke mentioned this issue Mar 1, 2024

SPR scalability issue #895

Closed

alheinecke changed the title ~~Performance variation on FP32 single-thread~~ Performance variation in single thread benchmark exeuction Mar 1, 2024

rengolin changed the title ~~Performance variation in single thread benchmark exeuction~~ Performance variation in single thread benchmark execution Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance variation in single thread benchmark execution #893

Performance variation in single thread benchmark execution #893

rengolin commented Feb 29, 2024 •

edited by alheinecke

Loading

rengolin commented Feb 29, 2024

alheinecke commented Mar 1, 2024

rengolin commented Mar 1, 2024

Performance variation in single thread benchmark execution #893

Performance variation in single thread benchmark execution #893

Comments

rengolin commented Feb 29, 2024 • edited by alheinecke Loading

rengolin commented Feb 29, 2024

alheinecke commented Mar 1, 2024

rengolin commented Mar 1, 2024

rengolin commented Feb 29, 2024 •

edited by alheinecke

Loading