Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix last_cachesize to be the total size. #29

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chriselrod
Copy link
Contributor

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

My update therefore assumes that it is shared.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

`VectorizationBae.cache_size` returns the size per core, so for shared caches, the number is much smaller.

My update therefore assumes that it is shared.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.
@carstenbauer
Copy link
Member

carstenbauer commented May 22, 2024

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

Thanks for pointing this out. We should change the default then.

My update therefore assumes that it is shared.

IIUC, it assumes that the (last-level) cache is shared among all cores. While this is typically true for consumer machines, on clusters it surely isn't. There, multiplying by the number of cores per CPU/socket is probably more appropriate. On the other hand, given that there are almost always only two sockets, being a factor of two off is perhaps fine.

Of course, we could probably use Hwloc to try to just get more detailed information about the system.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

Yes, we don't. But, on the other end of the spectrum, we also want to make sure that the arrays fit into memory. In this regard, note that there also is a (somewhat artificial) factor of 4 in https://github.com/JuliaPerf/STREAMBenchmark.jl/blob/master/src/benchmarks.jl#L4

@carstenbauer carstenbauer added the enhancement New feature or request label May 22, 2024
@chriselrod
Copy link
Contributor Author

chriselrod commented May 22, 2024

I think it's unlikely that fitting into memory will be a problem.

FWIW, I get an extreme value for bandwidth currently when multithreaded, as the array fits in local caches.

I think you should use the number of cores * last cache per core, not the total last cache size.

For example, an AMD system with 8 l3s will be able to comfortably fit the vector into its l3 caches when running the benchmark multithreaded: each chiplet needs only 1/8th of the array, so that's all each l3 needs to store.

The problem would be less extreme than what my desktop has on the master branch (the array currently fits it the CPU's l2 caches), but still give invalid results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants