Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching error on linux when removing cache files #64

Open
krynju opened this issue Jan 14, 2023 · 3 comments
Open

Caching error on linux when removing cache files #64

krynju opened this issue Jan 14, 2023 · 3 comments
Labels

Comments

@krynju
Copy link
Contributor

krynju commented Jan 14, 2023

Appears sometimes when process exits

IOError: unlink("/home/krynju/.mempool/sess-utvz1V-1/h2x1LD/jl_N2bctMjqbi"): no such file or directory (ENOENT)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:97 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem ./file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem ./file.jl:283
 [4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
   @ Base.Filesystem ./file.jl:294
 [5] (::MemPool.var"#203#206"{Int64})()
   @ MemPool ~/.julia/packages/MemPool/Ggdm4/src/MemPool.jl:163
 [6] _atexit()
   @ Base ./initdefs.jl:372
@jpsamaroo jpsamaroo added the bug label Jan 19, 2023
@jpsamaroo
Copy link
Collaborator

This sounds like the rm(...; recursive=true) call in our atexit cleanup hook is racing with the eviction process; it's not technically possible to ensure that all files are cleaned up in time, so we could pass force=true to ignore these errors, but that does make me feel slightly uncomfortable for unknown reasons. Thoughts?

@StevenWhitaker
Copy link

StevenWhitaker commented Nov 13, 2023

FYI I have also sometimes seen this issue on WSL 2 Ubuntu when exiting Julia.

Also, it actually might be reproducible, as I've gotten this error three times in a row with the MWE in JuliaParallel/DTables.jl#60 (comment), but with enable_disk_caching!(50, 10^2 * 20) (and I just realized my typo, I meant to do 2^10 * 20) inserted after loading packages:

julia> include("mwe.jl")

julia> for i = 1:100 main() end
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:length(dt3) = 233930
ERROR: On worker 2:
AssertionError: Failed to migrate 183.839 MiB for ref 349
Stacktrace:
  [1] #105
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:887
  [2] with_lock
    @ ~/.julia/packages/MemPool/l9nLj/src/lock.jl:80
  [3] #sra_migrate!#103
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:849
  [4] sra_migrate!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:826 [inlined]
  [5] write_to_device!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:817
  [6] #poolset#160
    @ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:386
  [7] #tochunk#139
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:267
  [8] tochunk (repeats 2 times)
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:259 [inlined]
  [9] #DTable#1
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:38
 [10] DTable
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:28
 [11] #create_dt_from_cols#9
    @ ~/tmp/mwe.jl:76
 [12] create_dt_from_cols
    @ ~/tmp/mwe.jl:68 [inlined]
 [13] update_value_col!
    @ ~/tmp/mwe.jl:88
 [14] query
    @ ~/tmp/mwe.jl:27
 [15] #invokelatest#2
    @ ./essentials.jl:819 [inlined]
 [16] invokelatest
    @ ./essentials.jl:816
 [17] #110
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285
 [18] run_work_thunk
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
 [19] macro expansion
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined]
 [20] #109
    @ ./task.jl:514
Stacktrace:
 [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
 [3] #remotecall_fetch#162
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [4] remotecall_fetch
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [5] main
   @ ~/tmp/mwe.jl:19 [inlined]
 [6] top-level scope
   @ ./REPL[2]:1

julia> # Exit Julia
┌ Warning: Worker 3 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
┌ Warning: Worker 5 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
┌ Warning: Worker 4 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
      From worker 2:    IOError: unlink("/home/steven/.mempool/sess-Qsvl77-2/RHtbsR/jl_JWnIX2z29e"): no such file or directory (ENOENT)
      From worker 2:    Stacktrace:
      From worker 2:      [1]┌ Error: Fatal error on process 2
      From worker 2:    │   exception =
      From worker 2:    │    attempt to send to unknown socket
      From worker 2:    │    Stacktrace:
      From worker 2:    │     [1] error(s::String)
      From worker 2:    │       @ Base ./error.jl:35
      From worker 2:    │     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:99
      From worker 2:    │     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:115
      From worker 2:    │     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:102
      From worker 2:    │     [5] macro expansion
      From worker 2:    │       @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:302 [inlined]
      From worker 2:    │     [6] (::Distributed.var"#113#115"{Distributed.CallWaitMsg, Distributed.MsgHeader, Sockets.TCPSocket})()
      From worker 2:    │       @ Distributed ./task.jl:514
      From worker 2:    └ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:106
      From worker 2:     uv_error
      From worker 2:        @ ./libuv.jl:100 [inlined]
      From worker 2:      [2] unlink(p::String)
      From worker 2:        @ Base.Filesystem ./file.jl:972
      From worker 2:      [3] rm(path::String; force::Bool, recursive::Bool)
      From worker 2:        @ Base.Filesystem ./file.jl:283
      From worker 2:      [4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
      From worker 2:        @ Base.Filesystem ./file.jl:294
      From worker 2:      [5] rm
      From worker 2:        @ ./file.jl:273 [inlined]
      From worker 2:      [6] exit_hook()
      From worker 2:        @ MemPool ~/.julia/packages/MemPool/l9nLj/src/MemPool.jl:152
      From worker 2:      [7] _atexit(exitcode::Int32)
      From worker 2:        @ Base ./initdefs.jl:416
      From worker 2:      [8] exit
      From worker 2:        @ ./initdefs.jl:28 [inlined]
      From worker 2:      [9] exit()
      From worker 2:        @ Base ./initdefs.jl:29
      From worker 2:     [10] #invokelatest#2
      From worker 2:        @ ./essentials.jl:819 [inlined]
      From worker 2:     [11] invokelatest(::Any)
      From worker 2:        @ Base ./essentials.jl:816
      From worker 2:     [12] (::Distributed.var"#118#120"{Distributed.RemoteDoMsg})()
      From worker 2:        @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:308
      From worker 2:     [13] run_work_thunk(thunk::Distributed.var"#118#120"{Distributed.RemoteDoMsg}, print_error::Bool)
      From worker 2:        @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
      From worker 2:     [14] (::Distributed.var"#117#119"{Distributed.RemoteDoMsg})()
      From worker 2:        @ Distributed ./task.jl:514
┌ Warning: Worker 2 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529

EDIT: I corrected my typo. Now I don't get the AssertionError, but I still get the IOError when exiting Julia.

@jpsamaroo
Copy link
Collaborator

The IOError is generally harmless, the file will be removed one way or the other (if it doesn't, let me know!). The AssertionError should be mostly "fixed" on master, but we might need to be a bit more eager with freeing data to keep within the size bounds we've set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants