Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDPSocket#close hangs #368

Open
jscheid opened this issue Jan 10, 2025 · 6 comments
Open

UDPSocket#close hangs #368

jscheid opened this issue Jan 10, 2025 · 6 comments

Comments

@jscheid
Copy link

jscheid commented Jan 10, 2025

Hi, I'm trying to debug an issue where a call to statsd.timing (see https://github.com/reinh/statsd) sometimes hangs when invoked from within an Async task, with latest everything (see environment below).

I've tracked it down to the following Fiber backtrace:

/usr/local/bundle/gems/async-2.21.1/lib/async/scheduler.rb:157:in `transfer'
/usr/local/bundle/gems/async-2.21.1/lib/async/scheduler.rb:157:in `transfer'
/usr/local/bundle/gems/async-2.21.1/lib/async/scheduler.rb:245:in `kernel_sleep'
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:453:in `close'
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:453:in `block in connect'
(SNIP)
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:451:in `connect'
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:485:in `send_to_socket'
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:502:in `send_stats'
/usr/local/bundle/gems/statsd-ruby-1.5.0/lib/statsd.rb:413:in `timing'

(This is the last line executed in "user" code.)

It just sits there indefinitely. I'm not sure that this happens every time that close is called in an Async block, but I can reproduce it reliably. Outside of Async it never hangs.

The UDPSocket connects to a port without listener. Looking at statsd source, the sequence of events should be something like:

# The following three lines are executed outside the Async block (possibly on a different thread)
socket = UDPSocket.new(Socket::AF_INET)
socket.connect('127.0.0.1', 45678) # any port that's not listening
socket.write('...') # first message goes OK (socket doesn't know yet nobody is listening)

# possibly more writes/reconnects outside Async block here (from multiple threads)

Sync do
  Async do
    socket.write('...') rescue nil # second message raises Errno::ECONNREFUSED
    # the error causes statsd to try and reconnect
    socket.flush
    socket.close # hangs here
  end
end

I've tried creating a test case based on the above but so far I can't reproduce it in isolation.

I'm a bit lost now - I might keep trying to reduce it to an isolated repro. In the meantime, do you have any ideas that could point me in the right direction?

My environment:
async (2.21.1)
io-event (1.7.5) (also tested with 1.7.4)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [aarch64-linux] (also tested with ruby 3.3.1) - via ruby:3.4.1-bullseye Docker image
Docker version 27.3.1, build ce12230
Linux df0dd8712c2d 6.12.5-orbstack-00287-gf8da5d508983 #19 SMP Tue Dec 17 08:07:20 UTC 2024 aarch64 GNU/Linux

@jscheid
Copy link
Author

jscheid commented Jan 10, 2025

Managed to repro 🎉

It seems to be caused by a combination of sync and async invocations - at least with 3.4.1.

https://github.com/jscheid/socketry-async-368

Please let me know what else I can do to help find a fix? I could go deep myself but I'm hoping that you, being way more familiar with all the internals, can pinpoint the problem much more quickly than me.

@ioquatix
Copy link
Member

I'll investigate.

cc @KJTsanaktsidis - do you have any thoughts?

@jscheid
Copy link
Author

jscheid commented Jan 13, 2025

The test case (https://github.com/jscheid/socketry-async-368) appears to fail regardless of selector implementation (tested with select, epoll, uring, kqueue).

@jscheid
Copy link
Author

jscheid commented Jan 13, 2025

When it hangs, I can observe the following behaviour in Scheduler#unblock after adding some debug output (writing straight to stderr for the debug output, bypassing Ruby IO): immediately before or after the line if selector = @selector an IOError is raised: stream closed in another thread.

With this my latest theory is: when there is a write in progress in one thread as the fd is closed in another thread, the close operation will wait for the write to complete. This mechanism, used to notify of write completion, is aware of fibers but appears to be prone to race conditions involving the special ruby_error_stream_closed exception.

@ioquatix
Copy link
Member

int rb_notify_fd_close(int fd, struct rb_io_close_wait_list *busy) {
...
                err = th->vm->special_exceptions[ruby_error_stream_closed];
                rb_threadptr_pending_interrupt_enque(th, err);
                rb_threadptr_interrupt(th);

This code looks problematic.

@ioquatix
Copy link
Member

Similar issue: ruby/net-http#197.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants