Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-36184 - To optimise dot_product in Power9 and Power10 architecture #3850

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mikejuliet13
Copy link

This patch optimises the dot_product function by leveraging vectorisation through SIMD intrinsics. Specifically, the function now uses __builtin_vec_vupkhsh and __builtin_vec_vupklsh to efficiently convert input values from lower to higher data types.
This transformation enables parallel execution of multiple operations, significantly improving the performance of dot product computation on supported architectures.
Performance Analysis:
The original dot_product function does undergo auto-vectorisation when compiled with -O3. However, performance analysis has shown that the newly optimised implementation performs better on Power10 and achieves comparable performance on Power9 machines.

Output Changes:
The logical output of the dot_product function remains unchanged (i.e., it still computes the correct dot product).
With this patch, computations utilise vector registers, leading to improved performance. These optimisations are internal and do not alter any user-visible behaviour.

Potential Side Effects:

  • This patch introduces architecture-specific optimisations targeted at Power9 and Power10 systems.
  • The function has been extensively tested on both Power9 and Power10, where it demonstrates correctness and performance improvements.
  • If executed on an older architecture (e.g., Power8 or below) that lacks support for these vector instructions, the implementation automatically falls back to DEFAULT_IMPLEMENTATION, ensuring broader compatibility.

Release Notes:

  • Optimised the dot_product function using SIMD vectorisation for improved performance.
  • Introduces architecture-specific optimisations for Power9 and Power10 systems.
  • No changes to observable output; improvements are purely in internal computation efficiency.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@svoj svoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Although needs some polishing.

@svoj svoj added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Feb 21, 2025
@vuvova
Copy link
Member

vuvova commented Feb 21, 2025

The original dot_product function does undergo auto-vectorisation when compiled with -O3. However, performance analysis has shown that the newly optimised implementation performs better on Power10 and achieves comparable performance on Power9 machines.

Do you have any numbers? How does your implementation compare with auto-vectorization? Did you benchmark it? On Power 10 and Power 9? Auto-vectorization at -O3 by what compiler version?

@mikejuliet13
Copy link
Author

The original dot_product function does undergo auto-vectorisation when compiled with -O3. However, performance analysis has shown that the newly optimised implementation performs better on Power10 and achieves comparable performance on Power9 machines.

Do you have any numbers? How does your implementation compare with auto-vectorization? Did you benchmark it? On Power 10 and Power 9? Auto-vectorization at -O3 by what compiler version?

I conducted benchmark tests on both Power9 and Power10 machines, comparing the time taken by the original (auto-vectorized) code and the new vectorised code. I used GCC 11.5.0 on RHEL 9.5 operating system with -O3. The benchmarks were performed using a sample test code with a vector size of 4096 and 10⁷ loop iterations.
Here are the average execution times (in seconds) over multiple runs:
Power9:

  • Before change: ~16.364 s
  • After change: ~16.180 s
  • Performance gain is modest but measurable.

Power10:

  • Before change: ~8.989 s
  • After change: ~6.446 s
  • Significant improvement, roughly 28–30% faster.

The final results of the dot product remained the same before and after the change, confirming functional correctness.

@svoj svoj changed the title To optimise dot_product in Power9 and Power10 architecture MDEV-36184 - To optimise dot_product in Power9 and Power10 architecture Feb 26, 2025
mikejuliet13 and others added 2 commits February 27, 2025 18:19
Removed space before '='
Removed POWER_IMPLEMENTATION macro from before function definition
Using int64_t and vector long long for handling dataoverflow
Removed code for // Process remaining elements

Signed-off-by: Manjul Mohan <[email protected]>
Copy link
Contributor

@svoj svoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else on my mind, just some minor tweaks.
We will have to clean-up commit history (such that there is just one commit), but we should be able to do it on our side.


static FVector *align_ptr(void *ptr)
{
return (FVector *)(MY_ALIGN(((intptr)ptr) + alloc_header, POWER_bytes) - alloc_header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be under 80 characters.

}

// Sum the accumulated vector long long values into a scalar int64_t sum
sum+= static_cast<int64_t>(ll_sum[0]) + static_cast<int64_t>(ll_sum[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this code it feels like sum is redundant. At least sum+= definitely is.

{
int64_t sum= 0;
vector long long ll_sum= {0, 0}; // Using vector long long for int64_t accumulation
size_t base= ((len + POWER_dims - 1) / POWER_dims) * POWER_dims; // Round up to process full vector, including padding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines should be under 80 characters, e.g. move comments up front.

vector long long ll_sum= {0, 0}; // Using vector long long for int64_t accumulation
size_t base= ((len + POWER_dims - 1) / POWER_dims) * POWER_dims; // Round up to process full vector, including padding

for (size_t i= 0; i < base; i+= 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be i+= POWER_dims.


// Vectorized multiplication
vector int product_hi= x_hi * y_hi;
vector int product_lo= x_lo * y_lo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we make use of vec_mule() / vec_mulo() here? They seem to perform widening multiply.
There seem to be nothing for widening add indeed.

Would it make sense to replace builtins with vec_unpackh / vec_unpackl at least? The mix looks really disturbing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.
Development

Successfully merging this pull request may close these issues.

4 participants