-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aes-gcm: Enable AVX-512 implementation. #2444
base: main
Are you sure you want to change the base?
Conversation
// Intel: "15.3 DETECTION OF 512-BIT INSTRUCTION GROUPS OF THE INTEL | ||
// AVX-512 FAMILY". | ||
// `OPENSSL_cpuid_setup` clears these bits when XCR0[7:5] isn't 0b111. | ||
// doesn't AVX-512 state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming PR #2439 is merged before this, then this will need to be updated.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2444 +/- ##
==========================================
- Coverage 96.61% 96.24% -0.37%
==========================================
Files 180 182 +2
Lines 21820 21963 +143
Branches 539 544 +5
==========================================
+ Hits 21081 21138 +57
- Misses 623 709 +86
Partials 116 116 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f9ead65
to
159aa07
Compare
This is blocked on code coverage testing. See also these pending changes upstream: See also issue #2469 regarding developing a workaround for this code to work in ancient binutils. |
The code coverage aspect of this comprises two parts:
I think GitHub is experimenting with some AVX-512-enabled actions runners so the tests might be flaky in the interim without explicitly using QEMU to target specific CPUs that would choose each implementation. In PR #2464 I am experimenting with QEMU 9.2.2, which adds newer CPUs than are available in QEMU 8.2.2 used in GitHub Actions Ubuntu 24.04 runners. |
# Issue the vzeroupper that is needed after using ymm or zmm registers. | ||
# Do it here instead of at the end, to minimize overhead for small AADLEN. | ||
vzeroupper | ||
|
||
# GHASH the remaining data 16 bytes at a time, using xmm registers only. | ||
.Laad_blockbyblock$local_label_suffix: | ||
test $AADLEN, $AADLEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.
But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.
Yes, I'm aware, but I don't know of any use cases for that at all that would be relevant to ring users. I only know that Google does it for some unknown reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.
Thanks. I will make those changes and also rebase this on top of the BoringSSL changes from upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(
If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()
? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10()
, that is equivalent to gcm_ghash_vpclmulqdq_avx10_512()
with len=16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10(), that is equivalent to gcm_ghash_vpclmulqdq_avx10_512() with len=16.
That is how we were doing things before with pre-VAES implementations, but we've been switching over to the (tweaked) ghash implementations because it's less fighting the rustc optimizer on the Rust side.
We had trouble, for example, getting rustc to always use SSE XOR instead of byte-by-byte XOR, in some cases, thought that might be resolved now. Also, we had trouble getting rustc to assume that the partial/single-block case is more likely than the multi-block case. In later rustc versions it will matter less once we can use likely/unlikely.
Regardless, in PR #2478 I tweaked the AVX2 version of this function to be based on the gmult implementation instead of the ghash implementation. (See also PR #2477, which attempts the same tweaks still based on the ghash implementation).
I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(
I don't think ring has any unusual users. We rely on people telling us what they need and we try to optimize for what people are actually using.
a20770f
to
34889f5
Compare
34889f5
to
7ca1e37
Compare
No description provided.