-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-36184 - To optimise dot_product in Power9 and Power10 architecture #3850
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Manjul Mohan <[email protected]>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Although needs some polishing.
Do you have any numbers? How does your implementation compare with auto-vectorization? Did you benchmark it? On Power 10 and Power 9? Auto-vectorization at -O3 by what compiler version? |
I conducted benchmark tests on both Power9 and Power10 machines, comparing the time taken by the original (auto-vectorized) code and the new vectorised code. I used GCC 11.5.0 on RHEL 9.5 operating system with -O3. The benchmarks were performed using a sample test code with a vector size of 4096 and 10⁷ loop iterations.
Power10:
The final results of the dot product remained the same before and after the change, confirming functional correctness. |
Removed space before '=' Removed POWER_IMPLEMENTATION macro from before function definition Using int64_t and vector long long for handling dataoverflow Removed code for // Process remaining elements Signed-off-by: Manjul Mohan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing else on my mind, just some minor tweaks.
We will have to clean-up commit history (such that there is just one commit), but we should be able to do it on our side.
|
||
static FVector *align_ptr(void *ptr) | ||
{ | ||
return (FVector *)(MY_ALIGN(((intptr)ptr) + alloc_header, POWER_bytes) - alloc_header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be under 80 characters.
} | ||
|
||
// Sum the accumulated vector long long values into a scalar int64_t sum | ||
sum+= static_cast<int64_t>(ll_sum[0]) + static_cast<int64_t>(ll_sum[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code it feels like sum
is redundant. At least sum+=
definitely is.
{ | ||
int64_t sum= 0; | ||
vector long long ll_sum= {0, 0}; // Using vector long long for int64_t accumulation | ||
size_t base= ((len + POWER_dims - 1) / POWER_dims) * POWER_dims; // Round up to process full vector, including padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines should be under 80 characters, e.g. move comments up front.
vector long long ll_sum= {0, 0}; // Using vector long long for int64_t accumulation | ||
size_t base= ((len + POWER_dims - 1) / POWER_dims) * POWER_dims; // Round up to process full vector, including padding | ||
|
||
for (size_t i= 0; i < base; i+= 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be i+= POWER_dims
.
|
||
// Vectorized multiplication | ||
vector int product_hi= x_hi * y_hi; | ||
vector int product_lo= x_lo * y_lo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we make use of vec_mule()
/ vec_mulo()
here? They seem to perform widening multiply.
There seem to be nothing for widening add
indeed.
Would it make sense to replace builtins with vec_unpackh
/ vec_unpackl
at least? The mix looks really disturbing.
This patch optimises the dot_product function by leveraging vectorisation through SIMD intrinsics. Specifically, the function now uses __builtin_vec_vupkhsh and __builtin_vec_vupklsh to efficiently convert input values from lower to higher data types.
This transformation enables parallel execution of multiple operations, significantly improving the performance of dot product computation on supported architectures.
Performance Analysis:
The original dot_product function does undergo auto-vectorisation when compiled with -O3. However, performance analysis has shown that the newly optimised implementation performs better on Power10 and achieves comparable performance on Power9 machines.
Output Changes:
The logical output of the dot_product function remains unchanged (i.e., it still computes the correct dot product).
With this patch, computations utilise vector registers, leading to improved performance. These optimisations are internal and do not alter any user-visible behaviour.
Potential Side Effects:
Release Notes: