Manual Offset Commit/Heartbeat Deadlock #2413

mro-rhansen2 · 2023-11-14T17:00:43Z

I believe that I have found the reason for the deadlock that has been alluded to in a few other issues on the board.

#2373
#2099
#2042
#1989

The offset commit appears to be blocked by-design with the assumption that the operation should resume without issue once the underlying network problem has been resolved. The issue appears to be that the consumer is not holding onto an exclusive client lock while it is waiting. This leads to a race condition between the main thread and the heartbeat thread due to a failure to maintain lock ordering.

The order of operations is as follows:

Consumer handles a message.
Network error occurs.
Consumer tries to commit offset. Commit blocks in an infinite loop and releases the KafkaClient lock on each attempt:

kafka-python/kafka/coordinator/consumer.py

Line 512 in 0864817

self.ensure_coordinator_ready()
While trying to identify a new coordinator, BaseCoordinator takes the KafkaClient lock and then the internal coordinator lock:

kafka-python/kafka/coordinator/base.py

Line 245 in 0864817

with self._client._lock, self._lock:
HeartbeatThread thread runs and only takes the coordinator lock:

kafka-python/kafka/coordinator/base.py

Line 958 in 0864817

with self.coordinator._lock:
HeartbeatThread detects consumer timeout and tries to shutdown the coordinator, taking the client lock only after having already taken the coordinator lock in the step above (the inverse order in how the locks are taken by the main thread during the block commit operation):

kafka-python/kafka/coordinator/base.py

Line 993 in 0864817

self.coordinator.maybe_leave_group()

kafka-python/kafka/coordinator/base.py

Line 766 in 0864817

with self._client._lock, self._lock:

It is admittedly a very tight window for a race condition but it does exist based on my own experience as well as that of others in the community. The problem can be avoided by allowing the consumer exclusive access to the KafkaClient while trying to commit the offset, or by ensuring that the heartbeat thread has exclusivity to the client while it is checking things out.

It should also be noted that, while I have only spelled out the race condition as it exists between the commit and heartbeat operations, I wouldn't be surprised if the heartbeat was also interfering with other operations because of this issue.

mro-rhansen2 · 2023-11-14T23:36:17Z

After further investigation, I did find another avenue where this same deadlock can occur during a consumer.poll(). If the timing is just right, then there's a chance that the HeartbeatThread deadlocks with a subsequent invocation of ensure_coordinator_ready() during the consumer.poll() that occurs immediately before trying to read data from the buffer.

Is there any particular reason for the finer grain on the locking scheme within the HeartbeatThread after line 958? It feels more like code bumming than anything else. The HeartbeatThread isn't doing much so it's probably just fine to be safe and hold onto the client lock rather than try and let other processes have the client in the event that the HeartbeatThread doesn't actually need it. Clients would be better off looking to asyncio if they needed to support the degree of concurrency that such an optimization seems to cater to.

ethiebautgeorge-nasuni mentioned this issue Feb 27, 2024

infinite loop in ensure_coordinator_ready when coordinator is unknown #2373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual Offset Commit/Heartbeat Deadlock #2413

Manual Offset Commit/Heartbeat Deadlock #2413

mro-rhansen2 commented Nov 14, 2023

mro-rhansen2 commented Nov 14, 2023

Manual Offset Commit/Heartbeat Deadlock #2413

Manual Offset Commit/Heartbeat Deadlock #2413

Comments

mro-rhansen2 commented Nov 14, 2023

mro-rhansen2 commented Nov 14, 2023