-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose algebraic floating point intrinsics #136457
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @thomcc (or someone else) some time within the next two weeks. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter cc @rust-lang/miri, @rust-lang/wg-const-eval |
Thanks for the speedy review @saethlin! I had a couple questions:
|
Thanks for the PR!
Going through the usual process, the first step would be a t-libs-api ACP to gauge the team's thinking on whether and how this should be exposed. This PR cannot land before a corresponding FCP has been accepted.
|
In terms of tests, usually we have doc tests. Also, given that the semantics are far from obvious (these operations are nondeterministic!), they need to be documented more carefully - probably in some central location, which is then referenced from everywhere.
|
@RalfJung: Thanks for the quick response!
|
https://doc.rust-lang.org/nightly/std/primitive.f32.html seems like a good place, that's where we already document everything special around NaNs. So a new section on algebraic operations there would probably be a good fit. |
I don't see an existing codegen test for the intrinsics so these should probably get one. https://github.com/rust-lang/rust/tree/3f33b30e19b7597a3acbca19e46d9e308865a0fe/tests/codegen/float would be a reasonable home. For reference, this would just be a file containing functions like this for each of the new methods, in order to verify the flags that we expect are getting set: // CHECK-LABEL: float @f32_algebraic_add(
#[no_mangle]
pub fn f32_algebraic_add(a: f32, b: f32) -> f32 {
// CHECK: fadd reassoc nsz arcp contract float %a, %b
a.algebraic_add(b)
} |
The Miri subtree was changed cc @rust-lang/miri |
// CHECK-LABEL: fp128 @f128_algebraic_add( | ||
#[no_mangle] | ||
pub fn f128_algebraic_add(a: f128, b: f128) -> f128 { | ||
// CHECK: fadd reassoc nsz arcp contract fp128 {{(%a, %b)|(%b, %a)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition and multiplication cases both end up as %b, %a
rather than %a, %b
which surprised me but isn't incorrect. I opted to allow either in case behavior changes in the future.
This looks pretty reasonable to me but all the public functions should get some examples. I think it may also be good to give a small demo of how this may work at the end of the new "Algebraic operators" section, e.g.: For example, the below:
```
x = x.algebraic_add(x, a.algebraic_mul(b));
```
May be rewritten as either of the following:
```
x = x + (a * b); // As written
x = (a * b) + x; // Reordered to allow using a single `fma`
``` |
Per function examplesDid you have specific examples in mind? I'm struggling to think of ones that aren't repetitive / low signal-to-noise (i.e. assert this algebraic add is approximately equal to a normal add). Should we instead link to the central documentation and test approximate equality of each of the ops in normal tests like this so they don't clutter the documentation? Central exampleMade a couple small edits for brevity. How's this look? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@bors try You can run this locally with |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Yeah, that's what I've been running but the last failure was on |
x86_64-gnu-aux is the name of a test runner. That runner runs many tests. Among them, it runs |
Gotcha, thanks! Confirmed I could reproduce with: MIRIFLAGS="-Zmiri-many-seeds" ./x miri std -- algebraic which is now passing at 9602bbb. |
☀️ Try build successful - checks-actions |
Could the tests get something like |
Such logic should be accompanied by comments explaining that this test does not imply any form of guarantee, it is merely a sentinel for us to notice when the real precision of these operations goes down. |
Updated all tests to assert exact equality outside of MIRI. To make this work I had to change --- a/library/std/tests/floats/lib.rs
+++ b/library/std/tests/floats/lib.rs
@@ -10,7 +10,7 @@ macro_rules! assert_approx_eq {
let (a, b) = (&$a, &$b);
let diff = (*a - *b).abs();
assert!(
- diff < $lim,
+ diff <= $lim,
"{a:?} is not approximately equal to {b:?} (threshold {lim:?}, difference {diff:?})",
lim = $lim
); 266ca8b#diff-a4ab8b2578f382b20890fc918f6cdbffcaa057078f8793f0d974c011003a2b01R13 If we don't want to loosen that we'll need to do something like this instead: --- a/library/std/tests/floats/f32.rs
+++ b/library/std/tests/floats/f32.rs
@@ -926,13 +926,17 @@ fn test_algebraic() {
// This is a check of current implementations and does NOT imply any form of
// guarantee about future behavior. The compiler reserves the right to make
// these operations inexact matches in the future.
- let eps_add = if cfg!(miri) { 1e-3 } else { 0.0 };
- let eps_mul = if cfg!(miri) { 1e-1 } else { 0.0 };
- let eps_div = if cfg!(miri) { 1e-4 } else { 0.0 };
-
- assert_approx_eq!(a.algebraic_add(b), a + b, eps_add);
- assert_approx_eq!(a.algebraic_sub(b), a - b, eps_add);
- assert_approx_eq!(a.algebraic_mul(b), a * b, eps_mul);
- assert_approx_eq!(a.algebraic_div(b), a / b, eps_div);
- assert_approx_eq!(a.algebraic_rem(b), a % b, eps_div);
+ if cfg!(miri) {
+ assert_approx_eq!(a.algebraic_add(b), a + b, 1e-3);
+ assert_approx_eq!(a.algebraic_sub(b), a - b, 1e-3);
+ assert_approx_eq!(a.algebraic_mul(b), a * b, 1e-1);
+ assert_approx_eq!(a.algebraic_div(b), a / b, 1e-4);
+ assert_approx_eq!(a.algebraic_rem(b), a % b, 1e-4);
+ } else {
+ assert_eq!(a.algebraic_add(b), a + b);
+ assert_eq!(a.algebraic_sub(b), a - b);
+ assert_eq!(a.algebraic_mul(b), a * b);
+ assert_eq!(a.algebraic_div(b), a / b);
+ assert_eq!(a.algebraic_rem(b), a % b);
+ }
} |
Latest changes lgtm. r=me with a squash and a passing try @bors try |
⌛ Testing commit 8ff7052 with merge 994f291efdb7969b06dff1ade574e447dbfacd91... |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
☀️ Try build successful - checks-actions |
@bors r+ |
Rollup of 11 pull requests Successful merges: - rust-lang#136457 (Expose algebraic floating point intrinsics) - rust-lang#137880 (Autodiff batching) - rust-lang#137897 (fix pthread-based tls on apple targets) - rust-lang#138024 (Allow optimizing out `panic_bounds_check` in Unicode checks.) - rust-lang#138546 (Add integer to string formatting tests) - rust-lang#138826 (StableMIR: Add `associated_items`.) - rust-lang#138950 (replace extra_filename with strict version hash in metrics file names) - rust-lang#139274 (Rustdoc: typecheck settings.js) - rust-lang#139285 (use lower case to match other error messages) - rust-lang#139341 (Apply `Recovery::Forbidden` when reparsing pasted macro fragments.) - rust-lang#139389 (make `Arguments::as_statically_known_str` doc(hidden)) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#136457 - calder:master, r=tgross35 Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Problem
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.
C++: 10us ✅
With Clang 18.1.3 and
-O2 -march=haswell
:Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Proposed Change
Add
core::intrinsics::f*_algebraic
wrappers tof16
,f32
,f64
, andf128
gated on a newfloat_algebraic
feature.Alternatives Considered
#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
References
try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux