-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BFloat16(::BigFloat) and BigFloat(::BFloat16) #63
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #63 +/- ##
==========================================
+ Coverage 66.29% 66.66% +0.36%
==========================================
Files 3 3
Lines 181 183 +2
==========================================
+ Hits 120 122 +2
Misses 61 61 ☔ View full report in Codecov by Sentry. |
This seems sensible, especially with bfloat being a truncated float, but I'm no floating-point expert so there may be an improved conversion. cc @oscardssmith maybe? |
Maybe there's a way to do this faster, I don't know how Float32(::BigFloat) and the inverse are implemented. But Float32 is a superset of BFloat16 so you don't lose any precision if you convert to Float32 first. |
the BFloat16(::BigFloat) version of this will differ double rounding. I suggest looking at how the float16 conversation works in Julia |
I see because of the cases where a-------x-b--------c, with a,c BFloat16s, b=(a+c)/2, a Float32 then x could be rounded to b::Float32 and subsequently to c even though it would round to a in a direct conversion |
Wouldn't this happen here too? Lines 400 to 410 in ba5104d
|
yes, but getting the elementary functions rounded correctly isn't a requirement (or easy), but correct rounding for arithmetic and conversion are relatively easy. |
Merge? |
as previously mentioned this version has double rounding, which can be avoided by using the same algorithm Float16 uses, but this is better than not having the capability |
Thanks; so let's merge this and open an issue to track the improvement. |
We currently have
and
This PR just defines those conversions via Float32, outside of the
llvm_arithmetic
blocks because I believe there's no LLVM functionality for BigFloat hence his needs to happen independently anyway.