Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance variation in single thread benchmark execution #893

Open
rengolin opened this issue Feb 29, 2024 · 3 comments
Open

Performance variation in single thread benchmark execution #893

rengolin opened this issue Feb 29, 2024 · 3 comments

Comments

@rengolin
Copy link
Contributor

rengolin commented Feb 29, 2024

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and "slower" on Lake?

These numbers are consisten across multiple runs on our cluster, AWS virtual and AWS metal.

FP32:

ZEN3 DNN TPP-MLIR Delta
FP32 105.2 113.0 107%
BF16 91.5 92.6 101%
MLP32 105.6 112.3 106%
MLP16 92.0 93.1 101%
CLX DNN TPP-MLIR Delta
FP32 172.8 165.3 96%
BF16 131.9 131.5 100%
MLP32 172.2 164.8 96%
MLP16 131.5 131.4 100%

BF16 on SPR:
309254652-2a0643ee-c707-4dec-9965-c2a1fe786108

@rengolin
Copy link
Contributor Author

Some ideas:

  • Libxsmm-dnn uses mmap to allocate temporary buffers on 2M page boundaries, while the LLVM JITter probably doesn't.
  • This explain the ICX 4% slowdown, but not the Zen3 7% speedup
  • Maybe Zen3 doesn't work well with that practice?

@alheinecke
Copy link
Contributor

#895 is the same problem, let's merge issues.

I have debugged it to the following extent:

  • it's only 1thr issue for very small problems
  • when changing libxsmm-dnn to use malloc instead of libxsmm_aligned_malloc the gap gets much smaller (libxsmm-dnn performance drops)
  • For larger sizes, e.g. C=K=2048 or Minibatch=1024m single thread performance of libxsmm-dnn and tpp-mlir is identical.

--> "solution" let's run benchmarks on some slightly larger problem sizes, where data is large than 2M pages etc.

@alheinecke alheinecke changed the title Performance variation on FP32 single-thread Performance variation in single thread benchmark exeuction Mar 1, 2024
@rengolin
Copy link
Contributor Author

rengolin commented Mar 1, 2024

We can also try alignment attribute on memref alloc. Doesn't hurt.

@rengolin rengolin changed the title Performance variation in single thread benchmark exeuction Performance variation in single thread benchmark execution Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants