Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance investigation #37

Open
1 of 3 tasks
NHDaly opened this issue Nov 5, 2018 · 5 comments
Open
1 of 3 tasks

Performance investigation #37

NHDaly opened this issue Nov 5, 2018 · 5 comments

Comments

@NHDaly
Copy link
Member

NHDaly commented Nov 5, 2018

Splitting out the benchmark discussion from #36.


Here's the benchmark in a gist:
https://gist.github.com/NHDaly/a8fae0d1d65ab1066c585c27e54146fa

And the results in a google spreadsheet:
https://docs.google.com/spreadsheets/d/1Lc3ughgwwK25cpwbnuLRxw19EtaMxdCsMmtnspT_4H8/edit?usp=sharing


Here are the results:

    Operation Values                
    identity   ÷   +   /   *  
Category Type time (ms) allocs time (ms) allocs time (ms) allocs time (ms) allocs time (ms) allocs
Int Int32 1.35 0 5.16 0 1.47 0 2.35 0 1.61 0
  Int64 1.86 0 18.66 0 1.89 0 2.46 0 1.95 0
  Int128 3.77 0 26.07 0 3.85 0 16.74 0 3.95 0
Float Float32 1.35 0 28.97 0 1.47 0 1.75 0 1.47 0
  Float64 1.85 0 27.37 0 1.88 0 2.45 0 1.89 0
FixedDecimal FD{Int32,2} 1.35 0 5.16 0 1.48 0 38.20 0 31.73 0
  FD{Int64,2} 1.86 0 18.75 0 1.89 0 59.18 0 47.03 0
  FD{Int128,2} 1320.01 14000000 26.19 0 1324.35 14000000 7267.32 72879639 6139.13 62000000

Here are my current question:

  • For some reason, even just element-wise copying an array of FixedDecimal{Int128, 2} into another array, allocates like crazy. (14,000,000 allocations / 1,000,000 elements).
    • Where do these allocations come from?
  • / and * of FixedDecimal{Int64, 2} are more expensive than for FixedDecimal{Int32, 2}., by a factor of around 1.5x each, whereas / and * for Int64 and Int32 are almost identical.
    • My current guess is that this might be related to promoting to Int128 during those operations (due to widemul), which seems to be slower than Int64 across the board.
  • / for Int128 is like 6x slower than for Int32! Where does that come from?
@NHDaly
Copy link
Member Author

NHDaly commented Nov 7, 2018

Ah, okay, so i think this is a clue...

For some reason, even simple assignment is calling convert. Does this make sense?:

julia> x = FixedPointDecimals.FixedDecimal{Int128,2}(3)
FixedDecimal{Int128,2}(3.00)

julia> @code_warntype Ref(x)[] = FixedPointDecimals.FixedDecimal{Int128,2}(5)
Body::Base.RefValue{FixedDecimal{Int128,2}}
33 1 ─      goto #3 if not false                                                                                                            │
   2nothing3%3 = invoke Base.convert(FixedDecimal{Int128,2}::Type{FixedDecimal{Int128,2}}, _3::FixedDecimal{Int128,2})::FixedDecimal{Int128,2}   │╻ setproperty!
   │        (Base.setfield!)(b, :x, %3)                                                                                                     ││
   └──      return b                                                                                                                        │

And then, convert for Int128 is expensive, because it widens to a BigInt for some reason:

julia> @code_warntype Base.convert(FixedPointDecimals.FixedDecimal{Int128,2}, FixedPointDecimals.FixedDecimal{Int128,2}(3))
Body::FixedDecimal{Int128,2}
282 1 ─       (Base.checked_sdiv_int)(100, 100)                                                                                         │╻      div
283%2  = (Base.getfield)(x, :i)::Int128                                                                                            │╻      getproperty
    │   %3  = invoke BigInt(%2::Int128)::BigInt                                                                                         ││╻╷     widen
    │         (Base.slt_int)(1, 0)                                                                                                      │││╻╷╷╷   convert
    │         (Base.slt_int)(1, 0)                                                                                                      ││││╻╷╷    Type
    │         (Base.ule_int)(0x00000000000000000000000000000001, 0x0000000000000000ffffffffffffffff)                                    │││││╻      <=%7  = Base.GMP.MPZ.set_ui::typeof(Base.GMP.MPZ.set_ui)                                                                          │││││╻      Type
    │   %8  = invoke %7(0x0000000000000001::UInt64)::BigInt                                                                             ││││││
    │   %9  = Base.GMP.MPZ.mul::typeof(Base.GMP.MPZ.mul)                                                                                ││╻      *%10 = invoke %9(%3::BigInt, %8::BigInt)::BigInt                                                                                 │││
    │   %11 = invoke $(Expr(:static_parameter, 1))(%10::BigInt)::Int128                                                                 │
    │   %12 = %new(FixedDecimal{Int128,2}, %11)::FixedDecimal{Int128,2}                                                                 │╻      reinterpret
    └──       return %122$(Expr(:unreachable))                                                                                                     │

The problem is it's calling this convert here:

julia> @which Base.convert(FixedPointDecimals.FixedDecimal{Int128,2}, FixedPointDecimals.FixedDecimal{Int128,2}(3))
convert(::Type{FixedDecimal{T,f}}, x::FixedDecimal{U,g}) where {T, f, U, g} in FixedPointDecimals at /Users/nathan.daly/.julia/dev/FixedPointDecimals/src/FixedPointDecimals.jl:289

So I think the solution is add an overload for where the types are the same. I'll try that and post back.

@NHDaly
Copy link
Member Author

NHDaly commented Nov 7, 2018

Ah, okay this issue exactly describes the problem. We're missing a convert(::Type{FD{T,f}}, x::{FD{T,f}}):
JuliaLang/julia#17559

@omus
Copy link
Contributor

omus commented Nov 8, 2018

And then, convert for Int128 is expensive, because it widens to a BigInt for some reason

The only widen call during construction is from max_exp10. If you're performing math with these types we do end up using widemul.

@NHDaly
Copy link
Member Author

NHDaly commented Nov 8, 2018

Ah, actually, no i think it was coming from this widemul:

function convert(::Type{FD{T, f}}, x::FD{U, g}) where {T, f, U, g}
if f g
# Compute `10^(f - g)` without overflow
powt = div(coefficient(FD{T, f}), coefficient(FD{U, g}))
reinterpret(FD{T, f}, T(widemul(x.i, powt)))

Because f == g matches f ≥ g.

In this case, where the types are identical, no conversion was necessary. It may also be worth it to investigate if we can make other simplifications when f and g are different, but T and U are the same, but I think that conversion is probably not that common of an operation.

@NHDaly
Copy link
Member Author

NHDaly commented Dec 6, 2018

Hi again!

So we've been poking at this for the last couple weeks, and we've got some significant performance improvements! :)

First, though, we've improved the benchmarks we're using to judge these changes. I think we've managed to cut out any of the extra costs that were in the old benchmark (like iterating through an array), so that we can show just the cost of each operation. Here's the new file:
https://gist.github.com/NHDaly/4bccca3bc44485d1b352c8a5137654c0

And here are the results for master, currently at 1768c58, on my machine:

    Operation Values                
    *   /   +   div   identity  
Category Type time* (ns) allocs time (ns) allocs time (ns) allocs time (ns) allocs time (ns) allocs
Int Int32 0.35 0 3.78 0 0.35 0 0.35 0 0.35 0
  Int64 0.35 0 3.78 0 0.35 0 0.35 0 0.35 0
  Int128 0.42 0 12.92 0 0.35 0 0.35 0 0.35 0
Float Float32 0.35 0 0.35 0 0.35 0 2.68 0 0.35 0
  Float64 0.35 0 0.35 0 0.35 0 2.58 0 0.35 0
FixedDecimal FD{ Int32,2} 10.30 0 8.70 0 0.35 0 0.35 0 0.35 0
  FD{ Int64,2} 18.16 0 21.00 0 0.35 0 2.77 0 0.35 0
  FD{Int128,2} 12037.26 639000000 13702.41 639000000 10925.18 615000000 14801.96 622000000 1080.33 463000000
Big BigFloat 70.83 8000000 140.22 8000000 42.25 8000000 153.84 8000000 0.35 6000000
  BigInt 202.63 11000000 487.38 18000000 182.90 11000000 200.24 10000000 0.35 8000000

With the following caveat:
* note: The times in each column are truncated to a minimum of 0.345ns, which is the estimated time for a single clock cycle on my machine.

What would you all think about merging some version of this benchmark file into this repo? It would be nice to have it running automatically and, ideally, detecting regressions between commits. With something like this, where performance is a feature, it's nice to be able to make informed statements about it!
I've just opened #42 to discuss it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants