Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aes-gcm: Enable AVX-512 implementation. #2444

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

briansmith
Copy link
Owner

No description provided.

@briansmith briansmith self-assigned this Mar 5, 2025
// Intel: "15.3 DETECTION OF 512-BIT INSTRUCTION GROUPS OF THE INTEL
// AVX-512 FAMILY".
// `OPENSSL_cpuid_setup` clears these bits when XCR0[7:5] isn't 0b111.
// doesn't AVX-512 state.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming PR #2439 is merged before this, then this will need to be updated.

Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 39.86014% with 86 lines in your changes missing coverage. Please review.

Project coverage is 96.24%. Comparing base (bcf68dd) to head (7ca1e37).

Files with missing lines Patch % Lines
src/aead/aes_gcm/vaesclmulavx512.rs 0.00% 67 Missing ⚠️
src/aead/gcm/vclmulavx512.rs 0.00% 15 Missing ⚠️
src/aead/gcm.rs 0.00% 3 Missing ⚠️
src/cpu.rs 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2444      +/-   ##
==========================================
- Coverage   96.61%   96.24%   -0.37%     
==========================================
  Files         180      182       +2     
  Lines       21820    21963     +143     
  Branches      539      544       +5     
==========================================
+ Hits        21081    21138      +57     
- Misses        623      709      +86     
  Partials      116      116              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@briansmith
Copy link
Owner Author

This is blocked on code coverage testing.

See also these pending changes upstream:
https://boringssl-review.googlesource.com/c/boringssl/+/77168
https://boringssl-review.googlesource.com/c/boringssl/+/77167

See also issue #2469 regarding developing a workaround for this code to work in ancient binutils.

@briansmith
Copy link
Owner Author

The code coverage aspect of this comprises two parts:

  • Ensure that the aes-gcm-avx2 implementation is tested in the coverage job regardless of whether the GItHub Actions runner would normally use the avx512 version.
  • Ensure that the aes-gcm-avx512 implementation is tested in the coverage job regardless of whether the GItHub Actions runner would normally use the avx512 version.

I think GitHub is experimenting with some AVX-512-enabled actions runners so the tests might be flaky in the interim without explicitly using QEMU to target specific CPUs that would choose each implementation.

In PR #2464 I am experimenting with QEMU 9.2.2, which adds newer CPUs than are available in QEMU 8.2.2 used in GitHub Actions Ubuntu 24.04 runners.

@briansmith briansmith added this to the 0.17.15 milestone Mar 11, 2025
# Issue the vzeroupper that is needed after using ymm or zmm registers.
# Do it here instead of at the end, to minimize overhead for small AADLEN.
vzeroupper

# GHASH the remaining data 16 bytes at a time, using xmm registers only.
.Laad_blockbyblock$local_label_suffix:
test $AADLEN, $AADLEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.

But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.

Yes, I'm aware, but I don't know of any use cases for that at all that would be relevant to ring users. I only know that Google does it for some unknown reason.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.

Thanks. I will make those changes and also rebase this on top of the BoringSSL changes from upstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(

If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10(), that is equivalent to gcm_ghash_vpclmulqdq_avx10_512() with len=16.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10(), that is equivalent to gcm_ghash_vpclmulqdq_avx10_512() with len=16.

That is how we were doing things before with pre-VAES implementations, but we've been switching over to the (tweaked) ghash implementations because it's less fighting the rustc optimizer on the Rust side.

We had trouble, for example, getting rustc to always use SSE XOR instead of byte-by-byte XOR, in some cases, thought that might be resolved now. Also, we had trouble getting rustc to assume that the partial/single-block case is more likely than the multi-block case. In later rustc versions it will matter less once we can use likely/unlikely.

Regardless, in PR #2478 I tweaked the AVX2 version of this function to be based on the gmult implementation instead of the ghash implementation. (See also PR #2477, which attempts the same tweaks still based on the ghash implementation).

I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(

I don't think ring has any unusual users. We rely on people telling us what they need and we try to optimize for what people are actually using.

@briansmith briansmith force-pushed the b/aes-gcm-avx512 branch 2 times, most recently from a20770f to 34889f5 Compare March 19, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants