Reducing time to first model #448

nalimilan · 2021-09-12T20:29:26Z

I've tried reducing latency ("time to first model") by enabling precompilation. Following the strategy adopted for DataFrames, I used SnoopCompile to extract all methods that are compiled when running the test suite:

using SnoopCompileCore
inf_timing = @snoopi include("test/runtests.jl")
using SnoopCompile
pc = SnoopCompile.parcel(inf_timing)
SnoopCompile.write("precompile_tmp.jl", pc[:GLM], always=true)

Unfortunately, the result is disappointing even for examples that are in the tests:

julia> using DataFrames, GLM
julia> @time begin
           data = DataFrame(X=[1,2,3], Y=[2,4,7])
           ols = lm(@formula(Y ~ X), data)
           show(stdout, "text/plain", ols)
       end
# Before
10.970611 seconds (28.70 M allocations: 1.659 GiB, 5.06% gc time, 2.04% compilation time)
# After
7.954707 seconds (19.61 M allocations: 1.109 GiB, 4.68% gc time, 2.84% compilation time)

julia> using DataFrames, GLM

julia> @time begin
           data = DataFrame(X=[1,2,2], Y=[1,0,1])
           probit = glm(@formula(Y ~ X), data, Binomial(), ProbitLink())
           show(stdout, "text/plain", probit)
       end
# Before
11.608224 seconds (29.52 M allocations: 1.704 GiB, 5.59% gc time, 7.11% compilation time)
# After
9.601293 seconds (23.09 M allocations: 1.319 GiB, 5.27% gc time, 10.48% compilation time)

This is probably due to two reasons:

Precompilation doesn't save everything (not machine code currently). Indeed, if I run the precompile directives directly in the session, I get slightly better timings (about 5s for the first example).
Most of the methods to precompile are in other modules. Actually, only 86 methods are in GLM, versus 280 in Base and 221 in StatsModels (other modules are marginal). Maybe precompiling some common methods in StatsModels would help a lot here.

The text was updated successfully, but these errors were encountered:

kleinschmidt · 2021-09-13T12:24:01Z

I suspect that there's going to be some work needed to get StatsModels to a point where that's possible. I think there's a lot of unnecessary specialization due to, ahem overzealous use of type parameters and tuples to represent collections of terms, and coercing all tables to Tables.ColumnTable (e.g., named tuple of vectors). I think that judicious use of wrappers will help with both of those (e.g. Columns instead of ColumnTable, MultiTerm instead of NTuple{<:AbstractTerm}), or extensive use of @nospecialize, but my attempts at the latter didn't yield much benefits in some informal latency testing so not sure.

ViralBShah · 2022-04-29T02:31:56Z

Just tried it. Still 10 seconds.

rafaqz · 2022-07-01T18:30:28Z

I think this is largely a type stability problem. There is a lot of type instability both here and in StatsModels.jl. Objects like ModelFrame and Schema, Set{AbstractTerm}() have abstract fields that propagate instability everywhere.

This comment in StatsModels - statsmodels.jl speaks to at least part of the problem:

# ## TODO: consider doing this manually, without the ModelFrame/ModelMatrix

Also this is a better test, as DataFrames.jl using a Dict for keys complicates things and adds complications and compile time of it's own:

@time begin
    data = (X=[1.0,2.0,3.0], Y=[2.0,4.0,7.0])
    ols = lm(@formula(Y ~ X), data)
    show(stdout, "text/plain", ols)
end

This could clearly be type stable as everything is known at compile time. The instabilities are introduced later by ModelFrame and Schema having abstract fields amd AbstractTerm being used everywhere.

I don't think any of that code will precompile.

Additionally, Term stores the X and Y as a Symbol in a field, not a type parameter. But in many places they are coerced into NamedTuple keys. They could just stay types, or never be types, but mixing both will cause problems.

rafaqz · 2022-07-01T19:50:39Z

Just moving sym to a type parameter in Term{S} (and fixing some fields on ModelFrame that didn't do much on their own) cuts this timing from 7.7 to 4.4 seconds for me. This is even with termvars treating them as a Vector{Symbol} and splatting to NamedTuple arguments.

@kleinschmidt I guess I'm suggesting the package could do with more specialisation, rather than less.

nalimilan · 2022-07-01T19:53:58Z

Additionally, Term stores the X and Y as a Symbol in a field, not a type parameter. But in many places they are coerced into NamedTuple keys. They could just stay types, or never be types, but mixing both will cause problems.

I think this is the key problem, as @kleinschmidt noted. Type instability is probably not an issue as stability is only useful for large datasets, while the one in the OP is super small. Anyway the most costly core operations are in GLM and are type-stable.

This comment in StatsModels - statsmodels.jl speaks to at least part of the problem:

# ## TODO: consider doing this manually, without the ModelFrame/ModelMatrix

#339 will get rid of this. But I don't expect it make a big difference.

rafaqz · 2022-07-01T23:20:20Z

Im talking about type stability as a compilation cost, not a run-time problem. Everything is boxed and unstable here. This is slow to compile and won't be saved in precompilation. Its probably the main reason your precompile attempts dont work.

In my experience using NamedTuple is fine as long as the keys always come from types. Its totally possible to do that here.

nalimilan mentioned this issue Mar 28, 2023

I propose to consider adding precompilation to GLM.jl #513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing time to first model #448

Reducing time to first model #448

nalimilan commented Sep 12, 2021

kleinschmidt commented Sep 13, 2021

ViralBShah commented Apr 29, 2022

rafaqz commented Jul 1, 2022 •

edited

Loading

rafaqz commented Jul 1, 2022 •

edited

Loading

nalimilan commented Jul 1, 2022

rafaqz commented Jul 1, 2022

Reducing time to first model #448

Reducing time to first model #448

Comments

nalimilan commented Sep 12, 2021

kleinschmidt commented Sep 13, 2021

ViralBShah commented Apr 29, 2022

rafaqz commented Jul 1, 2022 • edited Loading

rafaqz commented Jul 1, 2022 • edited Loading

nalimilan commented Jul 1, 2022

rafaqz commented Jul 1, 2022

rafaqz commented Jul 1, 2022 •

edited

Loading

rafaqz commented Jul 1, 2022 •

edited

Loading