Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearranged struct fields to prevent ldp page crossings #78

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dorukkarademirler
Copy link

Explanation of Structural Modifications

To enhance performance and reduce data cache pressure, the following structural modifications have been implemented:

Reordering Elements: The structure has been modified to prevent LDP instructions from crossing a 4K page boundary.

  1. Poly Array: Placed at the beginning of the structure.
  2. Invln2_Scaled: Positioned immediately after the poly array with a 16-byte alignment to ensure proper alignment.

Alignment Adjustments:

  1. The entire structure is now aligned to 64 bytes to further prevent page crossings.
  2. The tab variable has been moved by 256 bytes. This adjustment aligns the entire structure with a relatively small number, effectively fixing the page crossing error while minimizing wasted bytes in the worst-case scenario.

These changes collectively contribute to improved performance and reduced data cache pressure.

…p instructions from crossing page boundaries
@dorukkarademirler
Copy link
Author

For any issues or further communication related to this repository, please use my open source development email at Qualcomm: [email protected].

@joeramsay
Copy link
Contributor

Thanks for your interest in contributing! Please could you provide some details of measured speedup, with your architecture and compiler? In case you don't know, you can use the mathbench binary to get microbench numbers.

Is there some way of achieving what you want without aligning invln2_scaled by 16? I see a small (2-3%) performance regression on Neoverse V1 with GCC 14 from this patch, I think because the alignment prevents LDP fusion with the last element of poly.

To merge this we need a signed contribution agreement, so that we can update GLIBC under our FSF copyright assignment - when the PR is ready to merge please could you fill out https://github.com/ARM-software/optimized-routines/blob/master/contributor-agreement.pdf and email it to [email protected]? Printed/scanned is fine

@dorukkarademirler
Copy link
Author

dorukkarademirler commented Jan 29, 2025

This issue was fixed on Qualcomm's Android build, arm64 architecture. The main issue was that after updating to LLVM 18, the LDP statements were crossing the page boundary with the original structure. These changes help improve performance and reduce data cache pressure. Rather than a speedup, these modifications are aimed at preventing anomalies and significant performance loss. I am attaching an image of the performance loss observed.

Looking at the Geekbench results, Libm.so's CPU usage was approximately 3% without page crossings. However, with page crossings, it increased to around 11%.

simpleperf record -e cpu-cycles results:
LLVM17: no page crossings.

image

LLVM18: second ldp crosses the page.
ADD

Performance Comparison

cycles final

Regarding the small (2-3%) performance regression on Neoverse V1 with GCC 14, I committed a version where there isn't any alignment for invln2_scaled. You can check that version as well.

As for the contribution agreement, Qualcomm might already have an agreement with ARM. If I need to do this individually as well, I will send it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants