Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
LLVM lowers memory operations into loads and stores until a threshold after which it call the corresponding library function instead of emitting loads and stores. Currently, the threshold is four stores, but this number needs a better fine-tuning.
It is straightforward to notice that for higher number of bytes, calling a syscall is cheaper, since we only pay the maximum between 10 CUs or 1 CU per 250 bytes.
LLVM can't lower the operation when the number of bytes is unknown during compile time. In this case, we always call the library function. The functions in compiler builtins implement the operations manually again in an attempt to avoid the syscall if we have less than 15 stores.
For a comparison, I benchmarked the memory operations with 5 bytes, using a variable as size so that we always call the compiler builtins functions. These are the results of CU consumption:
Legend:
Test 1: Calling the existing compiler builtins function, which will manually expand the memory operation without invoking the syscall.
Test 2: Invoking the syscall directly.
Conclusion:
Invoking the syscall directly is more efficient than lowering the operation manually. Additionally, the extra code from the manual implementation increases program size.
Solution
Invoke the syscalls directly in the library functions. This change will improve the performance of native rust functions like
.clone()
,slice::fill
andcopy_from_slice
.Disclaimer
As we are on a tight deadline, this improvement will not be shipped with platform tools v1.44, but instead with v1.45.
I had this benchmark in my list for a while, but thanks to @febo bringing up the problem, I took the time to analyze it.