Performance investigation #37

NHDaly · 2018-11-05T21:54:35Z

Splitting out the benchmark discussion from #36.

Here's the benchmark in a gist:
https://gist.github.com/NHDaly/a8fae0d1d65ab1066c585c27e54146fa

And the results in a google spreadsheet:
https://docs.google.com/spreadsheets/d/1Lc3ughgwwK25cpwbnuLRxw19EtaMxdCsMmtnspT_4H8/edit?usp=sharing

Here are the results:

		Operation	Values
		identity		÷		+		/		*
Category	Type	time (ms)	allocs	time (ms)	allocs	time (ms)	allocs	time (ms)	allocs	time (ms)	allocs
Int	Int32	1.35	0	5.16	0	1.47	0	2.35	0	1.61	0
	Int64	1.86	0	18.66	0	1.89	0	2.46	0	1.95	0
	Int128	3.77	0	26.07	0	3.85	0	16.74	0	3.95	0
Float	Float32	1.35	0	28.97	0	1.47	0	1.75	0	1.47	0
	Float64	1.85	0	27.37	0	1.88	0	2.45	0	1.89	0
FixedDecimal	FD{Int32,2}	1.35	0	5.16	0	1.48	0	38.20	0	31.73	0
	FD{Int64,2}	1.86	0	18.75	0	1.89	0	59.18	0	47.03	0
	FD{Int128,2}	1320.01	14000000	26.19	0	1324.35	14000000	7267.32	72879639	6139.13	62000000

Here are my current question:

For some reason, even just element-wise copying an array of FixedDecimal{Int128, 2} into another array, allocates like crazy. (14,000,000 allocations / 1,000,000 elements).
- Where do these allocations come from?
/ and * of FixedDecimal{Int64, 2} are more expensive than for FixedDecimal{Int32, 2}., by a factor of around 1.5x each, whereas / and * for Int64 and Int32 are almost identical.
- My current guess is that this might be related to promoting to Int128 during those operations (due to widemul), which seems to be slower than Int64 across the board.
/ for Int128 is like 6x slower than for Int32! Where does that come from?

The text was updated successfully, but these errors were encountered:

NHDaly · 2018-11-07T23:04:08Z

Ah, okay, so i think this is a clue...

For some reason, even simple assignment is calling convert. Does this make sense?:

julia> x = FixedPointDecimals.FixedDecimal{Int128,2}(3)
FixedDecimal{Int128,2}(3.00)

julia> @code_warntype Ref(x)[] = FixedPointDecimals.FixedDecimal{Int128,2}(5)
Body::Base.RefValue{FixedDecimal{Int128,2}}
33 1 ─      goto #3 if not false                                                                                                            │
   2 ─      nothing                                                                                                                         │
   3 ─ %3 = invoke Base.convert(FixedDecimal{Int128,2}::Type{FixedDecimal{Int128,2}}, _3::FixedDecimal{Int128,2})::FixedDecimal{Int128,2}   │╻ setproperty!
   │        (Base.setfield!)(b, :x, %3)                                                                                                     ││
   └──      return b                                                                                                                        │

And then, convert for Int128 is expensive, because it widens to a BigInt for some reason:

julia> @code_warntype Base.convert(FixedPointDecimals.FixedDecimal{Int128,2}, FixedPointDecimals.FixedDecimal{Int128,2}(3))
Body::FixedDecimal{Int128,2}
282 1 ─       (Base.checked_sdiv_int)(100, 100)                                                                                         │╻      div
283 │   %2  = (Base.getfield)(x, :i)::Int128                                                                                            │╻      getproperty
    │   %3  = invoke BigInt(%2::Int128)::BigInt                                                                                         ││╻╷     widen
    │         (Base.slt_int)(1, 0)                                                                                                      │││╻╷╷╷   convert
    │         (Base.slt_int)(1, 0)                                                                                                      ││││╻╷╷    Type
    │         (Base.ule_int)(0x00000000000000000000000000000001, 0x0000000000000000ffffffffffffffff)                                    │││││╻      <=
    │   %7  = Base.GMP.MPZ.set_ui::typeof(Base.GMP.MPZ.set_ui)                                                                          │││││╻      Type
    │   %8  = invoke %7(0x0000000000000001::UInt64)::BigInt                                                                             ││││││
    │   %9  = Base.GMP.MPZ.mul::typeof(Base.GMP.MPZ.mul)                                                                                ││╻      *
    │   %10 = invoke %9(%3::BigInt, %8::BigInt)::BigInt                                                                                 │││
    │   %11 = invoke $(Expr(:static_parameter, 1))(%10::BigInt)::Int128                                                                 │
    │   %12 = %new(FixedDecimal{Int128,2}, %11)::FixedDecimal{Int128,2}                                                                 │╻      reinterpret
    └──       return %12                                                                                                                │
    2 ─       $(Expr(:unreachable))                                                                                                     │

The problem is it's calling this convert here:

FixedPointDecimals.jl/src/FixedPointDecimals.jl

Line 289 in d87debb

if f ≥ g

julia> @which Base.convert(FixedPointDecimals.FixedDecimal{Int128,2}, FixedPointDecimals.FixedDecimal{Int128,2}(3))
convert(::Type{FixedDecimal{T,f}}, x::FixedDecimal{U,g}) where {T, f, U, g} in FixedPointDecimals at /Users/nathan.daly/.julia/dev/FixedPointDecimals/src/FixedPointDecimals.jl:289

So I think the solution is add an overload for where the types are the same. I'll try that and post back.

NHDaly · 2018-11-07T23:16:26Z

Ah, okay this issue exactly describes the problem. We're missing a convert(::Type{FD{T,f}}, x::{FD{T,f}}):
JuliaLang/julia#17559

omus · 2018-11-08T01:02:00Z

And then, convert for Int128 is expensive, because it widens to a BigInt for some reason

The only widen call during construction is from max_exp10. If you're performing math with these types we do end up using widemul.

NHDaly · 2018-11-08T14:37:56Z

Ah, actually, no i think it was coming from this widemul:

FixedPointDecimals.jl/src/FixedPointDecimals.jl

Lines 288 to 292 in d87debb

    
           function convert(::Type{FD{T, f}}, x::FD{U, g}) where {T, f, U, g} 
        
               if f ≥ g 
        
                   # Compute `10^(f - g)` without overflow 
        
                   powt = div(coefficient(FD{T, f}), coefficient(FD{U, g})) 
        
                   reinterpret(FD{T, f}, T(widemul(x.i, powt)))

Because f == g matches f ≥ g.

In this case, where the types are identical, no conversion was necessary. It may also be worth it to investigate if we can make other simplifications when f and g are different, but T and U are the same, but I think that conversion is probably not that common of an operation.

NHDaly · 2018-12-06T16:23:22Z

Hi again!

So we've been poking at this for the last couple weeks, and we've got some significant performance improvements! :)

First, though, we've improved the benchmarks we're using to judge these changes. I think we've managed to cut out any of the extra costs that were in the old benchmark (like iterating through an array), so that we can show just the cost of each operation. Here's the new file:
https://gist.github.com/NHDaly/4bccca3bc44485d1b352c8a5137654c0

And here are the results for master, currently at 1768c58, on my machine:

		Operation	Values
		*		/		+		div		identity
Category	Type	time* (ns)	allocs	time (ns)	allocs	time (ns)	allocs	time (ns)	allocs	time (ns)	allocs
Int	Int32	0.35	0	3.78	0	0.35	0	0.35	0	0.35	0
	Int64	0.35	0	3.78	0	0.35	0	0.35	0	0.35	0
	Int128	0.42	0	12.92	0	0.35	0	0.35	0	0.35	0
Float	Float32	0.35	0	0.35	0	0.35	0	2.68	0	0.35	0
	Float64	0.35	0	0.35	0	0.35	0	2.58	0	0.35	0
FixedDecimal	FD{ Int32,2}	10.30	0	8.70	0	0.35	0	0.35	0	0.35	0
	FD{ Int64,2}	18.16	0	21.00	0	0.35	0	2.77	0	0.35	0
	FD{Int128,2}	12037.26	639000000	13702.41	639000000	10925.18	615000000	14801.96	622000000	1080.33	463000000
Big	BigFloat	70.83	8000000	140.22	8000000	42.25	8000000	153.84	8000000	0.35	6000000
	BigInt	202.63	11000000	487.38	18000000	182.90	11000000	200.24	10000000	0.35	8000000

With the following caveat:
* note: The times in each column are truncated to a minimum of 0.345ns, which is the estimated time for a single clock cycle on my machine.

What would you all think about merging some version of this benchmark file into this repo? It would be nice to have it running automatically and, ideally, detecting regressions between commits. With something like this, where performance is a feature, it's nice to be able to make informed statements about it!
I've just opened #42 to discuss it. :)

NHDaly mentioned this issue Nov 5, 2018

Fix construction with same type performing convert #36

Merged

NHDaly mentioned this issue Nov 7, 2018

Performance fix: avoid unnecessary convert on setindex!、copy when type is same. #38

Merged

NHDaly mentioned this issue Dec 6, 2018

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance investigation #37

Performance investigation #37

NHDaly commented Nov 5, 2018 •

edited

Loading

NHDaly commented Nov 7, 2018

NHDaly commented Nov 7, 2018

omus commented Nov 8, 2018 •

edited

Loading

NHDaly commented Nov 8, 2018

NHDaly commented Dec 6, 2018 •

edited

Loading

Performance investigation #37

Performance investigation #37

Comments

NHDaly commented Nov 5, 2018 • edited Loading

NHDaly commented Nov 7, 2018

NHDaly commented Nov 7, 2018

omus commented Nov 8, 2018 • edited Loading

NHDaly commented Nov 8, 2018

NHDaly commented Dec 6, 2018 • edited Loading

NHDaly commented Nov 5, 2018 •

edited

Loading

omus commented Nov 8, 2018 •

edited

Loading

NHDaly commented Dec 6, 2018 •

edited

Loading