Fix `last_cachesize` to be the total size. #29

chriselrod · 2024-05-21T17:45:19Z

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

My update therefore assumes that it is shared.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

`VectorizationBae.cache_size` returns the size per core, so for shared caches, the number is much smaller. My update therefore assumes that it is shared. What is the desired behavior? With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

carstenbauer · 2024-05-22T09:22:04Z

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

Thanks for pointing this out. We should change the default then.

My update therefore assumes that it is shared.

IIUC, it assumes that the (last-level) cache is shared among all cores. While this is typically true for consumer machines, on clusters it surely isn't. There, multiplying by the number of cores per CPU/socket is probably more appropriate. On the other hand, given that there are almost always only two sockets, being a factor of two off is perhaps fine.

Of course, we could probably use Hwloc to try to just get more detailed information about the system.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

Yes, we don't. But, on the other end of the spectrum, we also want to make sure that the arrays fit into memory. In this regard, note that there also is a (somewhat artificial) factor of 4 in https://github.com/JuliaPerf/STREAMBenchmark.jl/blob/master/src/benchmarks.jl#L4

chriselrod · 2024-05-22T11:03:11Z

I think it's unlikely that fitting into memory will be a problem.

FWIW, I get an extreme value for bandwidth currently when multithreaded, as the array fits in local caches.

I think you should use the number of cores * last cache per core, not the total last cache size.

For example, an AMD system with 8 l3s will be able to comfortably fit the vector into its l3 caches when running the benchmark multithreaded: each chiplet needs only 1/8th of the array, so that's all each l3 needs to store.

The problem would be less extreme than what my desktop has on the master branch (the array currently fits it the CPU's l2 caches), but still give invalid results.

carstenbauer added the enhancement New feature or request label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `last_cachesize` to be the total size. #29

Fix `last_cachesize` to be the total size. #29

chriselrod commented May 21, 2024

carstenbauer commented May 22, 2024 •

edited

Loading

chriselrod commented May 22, 2024 •

edited

Loading

Fix last_cachesize to be the total size. #29

Are you sure you want to change the base?

Fix last_cachesize to be the total size. #29

Conversation

chriselrod commented May 21, 2024

carstenbauer commented May 22, 2024 • edited Loading

chriselrod commented May 22, 2024 • edited Loading

Fix `last_cachesize` to be the total size. #29

Fix `last_cachesize` to be the total size. #29

carstenbauer commented May 22, 2024 •

edited

Loading

chriselrod commented May 22, 2024 •

edited

Loading