Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent deadlock due to system call error #159

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

leoarnold
Copy link
Contributor

The MQTT client runs in the main thread which consumes incoming MQTT messages from a Thread::Queue called @read_queue. This queue is fed by the @read_thread, a child thread which reads from a socket in an infinite loop.

We noticed that our application stopped processing incoming MQTT messages, but did not seem to exit or throw an exception either. Upon inspecting the logs, we saw that the @read_thread had crashed due to an unhandled Errno::ECONNRESET while reading from the socket. This had the consuming thread then sleep forever while waiting for new meassages on the @read_queue.

It turned out that the MQTT::Client#receive_packet method had appropriate error handling in place

def receive_packet
  # ...
rescue Exception
  # ...
end

but this did not rescue Errno::ECONNRESET even though Exception is at the top of the hierarchy of
Ruby's built-in exception classes.

The root cause was that the library also defines a class MQTT::Exception and in the context of #receive_packet the constant Exception refers only to MQTT::Exception (which does not cover Errno::ECONNRESET) where it actually should rescue any subclass of ::Exception.

The MQTT client runs in the main thread which consumes incoming
MQTT messages from a `Thread::Queue` called `@read_queue`.
This queue is fed by the `@read_thread`, a child thread which
reads from a socket in an infinite loop.

We noticed that our application stopped processing incoming
MQTT messages, but did not seem to exit or throw an exception either.
Upon inspecting the logs, we saw that the `@read_thread` had crashed
due to an unhandled `Errno::ECONNRESET` while reading from the socket.
This had the consuming thread then sleep forever while waiting for
new meassages on the `@read_queue`.

It turned out that the `MQTT::Client#receive_packet` method had
appropriate error handling in place

```ruby
def receive_packet
  # ...
rescue Exception
  # ...
end
```

but this did not rescue `Errno::ECONNRESET` even though `Exception`
is at the top of the hierarchy of
[Ruby's built-in exception classes][builtin-exceptions].

The root cause was that the library also defines a class `MQTT::Exception`
and in the context of `#receive_packet` the constant `Exception`
refers only to `MQTT::Exception` (which does not cover `Errno::ECONNRESET`)
where it actually should rescue any subclass of `::Exception`.

[builtin-exceptions]: https://docs.ruby-lang.org/en/3.2/Exception.html#class-Exception-label-Built-In+Exception+Classes
@leoarnold
Copy link
Contributor Author

@phlegx FYI

@blmundie
Copy link

I'm running into a threaded processing lockup that seems to be a similar issue

@bmorrall
Copy link

bmorrall commented Mar 5, 2024

Any follow up on getting this merged? This was a real problem for us a few months ago.

We found setting Thread.current.abort_on_exception = true would at least kill the listening process for us (and allow our system to reboot and rebuild the connection), but this fix is much cleaner and less janky.

@njh njh merged commit 0b967ff into njh:main Apr 2, 2024
@njh
Copy link
Owner

njh commented Apr 2, 2024

Thanks for this. The exception code has been wrong for a very long time and it is long overdue getting sorted out.

According to Semver, do you think this change is worthy of a Major, Minor or Patch increment? 🤔

@leoarnold
Copy link
Contributor Author

Hey @njh, good to see this progressing!

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable. https://semver.org/#spec-item-4

Since you're just merging fixes (I assume) I'd argue that is only a PATCH increment. If you're also taking #112, that one mandates a cautionary note in CHANGELOG.md.

@leoarnold leoarnold deleted the leoarnold/readthread-crash branch April 2, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants