KAFKA-18641: AsyncKafkaConsumer could lose records with auto offset commit #18737

frankvicky · 2025-01-29T07:17:40Z

JIRA: KAFKA-18641
Please refer to jira ticket for further details.
The application thread advances positions, but SubscriptionState#allConsumed() is called by a background thread. In the current architecture, there is no way to sync the offsets between two threads, which leads to inconsistency between committed offsets and actually consumed records.

This patch implements a waiting mechanism to ensure the background thread has generated commit requests before allowing the application thread to start a new poll cycle and fetch new messages.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

frankvicky · 2025-01-29T13:22:26Z

Hi @lianetm

I am currently moving the invocation from the background thread to the application thread. This ensures there will be no gap between committed offsets and actually consumed records.

However, this change raises some considerations. If we want to ensure SubscriptionState#allConsumed() is invoked by the application thread, we need to rely on events or event processor helper methods to deliver the offsets.

One last consideration is regarding AsyncKafkaConsumer#commitSync(final Duration timeout). Currently, it always passes Optional#empty() as an argument, which causes the background thread to invoke SubscriptionState#allConsumed(). Since this patch prevents invoking SubscriptionState#allConsumed() from the background thread, I think we should update AsyncKafkaConsumer#commitSync(final Duration timeout) to pass SubscriptionState#allConsumed() as an argument instead. WDYT?

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

Line 390 in f960e20

    
           Map<TopicPartition, OffsetAndMetadata> commitOffsets = offsets.orElseGet(subscriptions::allConsumed);

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 1448 in f960e20

public void commitSync(final Duration timeout) {

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 1466 in f960e20

    
           SyncCommitEvent syncCommitEvent = new SyncCommitEvent(offsets, calculateDeadlineMs(time, timeout));

lianetm · 2025-01-29T16:36:49Z

Hey @frankvicky , thanks for tackling this promptly!

High level comment, we've been intentionally keeping most of the subscription state actions into the background thread, to avoid race conditions that we had in the past (ex. app thread trying to access subscription state info for a partition that was no longer assigned because the background had removed it with a reconciliation). The the trick is that the subscription state changes in the background, ex. when reconciliation completes and updates assignment, when events like seek or updateFetchPositions are processed and update positions).

So thinking about this case, what about the option of having the PollEvent processing (in the background thread) trigger all actions the CommitMgr has to at the beginning of each poll iteration:

update auto-commit timer (already done)
take a snapshot of the subscriptionState.allConsumed to be used on any commit of all consumed until the next poll (new, to fix the gap)

With that, all commits will continue to retrieve the allConsumed in the background as they do now, and the fix is more about "when" to retrieve them.

before this PR the allConsumed are retrieved freely when needed to be used for committing (and that's wrong because there could be a fetch half-way)
with this fix, the allConsumed to commit would only be retrieved at the beginning of the poll loop before fetching (but in the background, via a PollEvent event).

I'm thinking out loud here, could be missing stuff, let me know what do you think ;)

kirktrue

Thanks for the PR @frankvicky!

I left a few comments/questions, but I think this is headed in the right direction.

QQ: the offsets map isn't cleared after it's used. Should it be?

Thanks!

kirktrue · 2025-01-29T20:15:29Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

@@ -88,6 +87,7 @@ public class CommitRequestManager implements RequestManager, MemberStateListener
    private final boolean throwOnFetchStableOffsetUnsupported;
    final PendingRequests pendingRequests;
    private boolean closing = false;
+    private Map<TopicPartition, OffsetAndMetadata> allConsumed = Map.of();


I find the name a little confusing. (I know we get it from SubscriptionState.) What about latestPartitionOffsets or something? If nothing else, please add a comment to make it clear what this map contains.

This might be a stylistic choice, but what's the advantage of this over using Collections.emptyMap()?

kirktrue · 2025-01-29T20:16:50Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

+    public Map<TopicPartition, OffsetAndMetadata> allConsumed() {
+        return allConsumed;
+    }
+
+    public void allConsumed(Map<TopicPartition, OffsetAndMetadata> allConsumed) {
+        this.allConsumed = allConsumed;
+    }
+


We should probably stick to the convention of using set for setters. Also, it would be prudent to make the given map immutable so that when other code invokes the getter they're not able to modify it.

Suggested change

public Map<TopicPartition, OffsetAndMetadata> allConsumed() {

return allConsumed;

}

public void allConsumed(Map<TopicPartition, OffsetAndMetadata> allConsumed) {

this.allConsumed = allConsumed;

}

public Map<TopicPartition, OffsetAndMetadata> latestPartitionOffsets() {

return latestPartitionOffsets;

}

public void setLatestPartitionOffsets(Map<TopicPartition, OffsetAndMetadata> offsets) {

this.latestPartitionOffsets = Collections.unmodifiableMap(offsets);

}

kirktrue · 2025-01-29T20:19:03Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

-                applicationEventHandler.add(new PollEvent(timer.currentTimeMs()));
+                applicationEventHandler.add(new PollEvent(timer.currentTimeMs(), subscriptions.allConsumed()));


We've done a lot of work to ensure the background thread "owns" the current subscription state. This seems to go against those efforts.

Can we add a comment to this line that explains why we don't want to the background thread to "own" the set of consumed partitions? This would be helpful since later on during the poll process we do implicitly let the background thread determine the partitions to fetch (rather than having the application thread pass in the partitions).

kirktrue · 2025-01-29T20:28:26Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

+            commitRequestManager.updateAutoCommitTimer(event.pollTimeMs());
+            commitRequestManager.allConsumed(event.allConsumed());


Can we consolidate the auto-commit timer and the offsets map in a single method call that's more clearly named? Are there other places that call updateAutoCommitTimer() other than this path?

kirktrue · 2025-01-29T20:29:41Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

+            requestManagers.commitRequestManager.ifPresent(commitRequestManager ->
+                commitRequestManager.allConsumed(subscriptions.allConsumed()));


Why is allConsumed() called here again, but this time with the data from SubscriptionState.allConsumed(). Please add some comments to explain. Thanks.

kirktrue · 2025-01-29T20:30:05Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/events/PollEvent.java

        super(Type.POLL);
        this.pollTimeMs = pollTimeMs;
+        this.allConsumed = allConsumed;


Please wrap this in a call to Collections.unmodifiableMap().

frankvicky · 2025-01-30T14:52:35Z

Hi @lianetm

take a snapshot of the subscriptionState.allConsumed to be used on any commit of all consumed until the next poll (new, to fix the gap)

I think not only poll will affect the offsets. Following this logic, shouldn't this also apply to seek and assign?
Since there is not only poll that will affect the offset, if we do this way, we should review these operations either.

For example:
kafka.api.PlaintextConsumerCommitTest#testAutoCommitOnClose.
Currently, we rely on commitSync and commitAsync, which have invoked subscriptionState.allConsumed behind the scenes to get the offset.

lianetm · 2025-01-30T15:44:33Z

Hey @frankvicky , good point (but you mean seek and position I guess?) Those are the ones, other than poll, that can update the positions. Agree we need to consider them too.

In the end it's truly only 2 events behind the 3 calls (SeekUnvalidatedEvent and CheckAndUpdatePositionsEvent), so one option would be just to make sure that when we process those events in the background, the commitMgr updates the positions snapshot to commit (after the event completes, to make sure it has the positions updated). With this, we wouldn't need to change anything in the PollEvent I expect. Thoughts?

frankvicky · 2025-01-31T06:32:36Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

+            CompletableFuture<Map<TopicPartition, OffsetAndMetadata>> future = manager.commitSync(
+                event.offsets().orElseGet(subscriptions::allConsumed),
+                event.deadlineMs()
+            );


Hi @lianetm
I believe commitSync also needs to get the latest consumed offsets, so I added a subscriptions::allConsumed invocation here.

uhm not sure, relates to comment above

frankvicky · 2025-01-31T06:45:21Z

Hi @kirktrue
Thanks for the review.
Since the current version is hugely different from the previous one, please retake a look.

lianetm

Hey @frankvicky ! Thanks for the updates! Most important comment to discuss is #18737 (comment)

lianetm · 2025-01-31T15:53:42Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

@@ -414,7 +418,16 @@ private void process(final ResetOffsetEvent event) {
     */
    private void process(final CheckAndUpdatePositionsEvent event) {
        CompletableFuture<Boolean> future = requestManagers.offsetsRequestManager.updateFetchPositions(event.deadlineMs());
-        future.whenComplete(complete(event.future()));
+        final CompletableFuture<Boolean> b = event.future();
+        future.whenComplete((BiConsumer<? super Boolean, ? super Throwable>) (value, exception) -> {


do we really need to cast the (value, exception) here?

No, I should clean up after using the inline function from idea. I will remove this unneeded cast in the next commit.

lianetm · 2025-01-31T15:57:09Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

+    }
+
+    public void setLatestPartitionOffsets(Map<TopicPartition, OffsetAndMetadata> offsets) {
+        this.latestPartitionOffsets = Collections.unmodifiableMap(offsets);


should we add a debug log here to know that we're updating the all consumed positions to be committed? (I expect it will be helpful to track the flow if needed)

Yes, it would be helpful

lianetm · 2025-01-31T19:04:41Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java

        verify(metadata).updateLastSeenEpochIfNewer(tp, 1);
        assertTrue(future.isDone());
        Map<TopicPartition, OffsetAndMetadata> commitOffsets = assertDoesNotThrow(() -> future.get());
        assertEquals(offsets, commitOffsets);
    }

    @Test
-    public void testCommitAsyncWithEmptyAllConsumedOffsets() {
+    public void testCommitAsyncWithEmptyLatestPartitionOffsetsOffsets() {


I would say the test name still applies as it was (just that allConsumed is taken when we know it has been returned). The new one seems a bit confusing

lianetm · 2025-01-31T20:18:52Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

@@ -414,7 +418,16 @@ private void process(final ResetOffsetEvent event) {
     */
    private void process(final CheckAndUpdatePositionsEvent event) {
        CompletableFuture<Boolean> future = requestManagers.offsetsRequestManager.updateFetchPositions(event.deadlineMs());
-        future.whenComplete(complete(event.future()));
+        final CompletableFuture<Boolean> b = event.future();


is there a reason why we need this var? (vs using event.future directly to complete below)

related comment: #18737 (comment)

lianetm · 2025-01-31T20:55:47Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

@@ -389,7 +388,7 @@ private void autoCommitSyncBeforeRevocationWithRetries(OffsetCommitRequestState
     * future will be completed with a {@link RetriableCommitFailedException}.
     */
    public CompletableFuture<Map<TopicPartition, OffsetAndMetadata>> commitAsync(final Optional<Map<TopicPartition, OffsetAndMetadata>> offsets) {
-        Map<TopicPartition, OffsetAndMetadata> commitOffsets = offsets.orElseGet(subscriptions::allConsumed);
+        Map<TopicPartition, OffsetAndMetadata> commitOffsets = offsets.orElseGet(this::latestPartitionOffsets);


uhm I don't think we should change here, and it's actually dangerous I believe. This is my reasoning (please correct me at any point): we have 2 kinds of commit operations in this manager:

commits triggered automatically in the background (commit before rebalance and auto-commit on the interval)

commits triggered by API calls (commitSync and commitAsync, which are only triggered by a consumer.commitSync/Async call or consumer.close. Note that these could be for specific offsets, or for allConsumed)

My take is that with this PR we need to change only 1, which are the ones affected by the race condition with the fetch happening within a consumer poll iteration. Those commits that happen automatically cannot take the allConsumed from the subscription state because we could be in the middle of a consumer poll iteration in the app thread (with positions advanced but the records not returned yet). So agree with the changes to maybeAutoCommitSyncBeforeRevocation and maybeAutoCommitAsync to not use subscriptionState.allConsumed.

But, the commits grouped in 2 (triggered by consumer API calls), can and should use the allConsumed from the subscriptionState I expect, as they happen outside of poll the loop, so first, they don't land in the race we're targeting, and most importantly, we cannot even ensure that the commitMgr latestPartitionOffsets has the positions returned when they are called (this is the dangerous part).

Ex. single call to poll that returns 5 records + commitSync()/commitAsync()
If that commit takes the latestPartitionOffsets from the commitReqMgr, wouldn't that be 0? the latestPartitionOffsets is only incremented on the next call to poll (if any), which makes sense, because that's the only time, when running a continuos poll, that we can certainly assume that the records have been returned (on the previous iteration). Makes sense?

Makes sense.
For commitSync()/commitAsync() calls without arguments, we need a way to get the consumed offsets.

To ensure we always have the most up-to-date offset information at the time of commit, we should directly access subscriptionState.allConsumed. This gives us the current state rather than a potentially stale snapshot.

lianetm · 2025-01-31T20:57:52Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

+            CompletableFuture<Map<TopicPartition, OffsetAndMetadata>> future = manager.commitSync(
+                event.offsets().orElseGet(subscriptions::allConsumed),
+                event.deadlineMs()
+            );


uhm not sure, relates to comment above

junrao

@frankvicky : Thanks for the PR. Left a comment.

junrao · 2025-01-31T22:52:10Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

@@ -324,7 +323,7 @@ public CompletableFuture<Void> maybeAutoCommitSyncBeforeRevocation(final long de

        CompletableFuture<Void> result = new CompletableFuture<>();
        OffsetCommitRequestState requestState =
-            createOffsetCommitRequest(subscriptions.allConsumed(), deadlineMs);
+            createOffsetCommitRequest(latestPartitionOffsets, deadlineMs);


There is a subtle issue here. AbstractMembershipManager does the following step when revoking a partiton.

// Mark partitions as pending revocation to stop fetching from the partitions (no new // fetches sent out, and no in-flight fetches responses processed). markPendingRevocationToPauseFetching(revokedPartitions); // Commit offsets if auto-commit enabled before reconciling a new assignment. Request will // be retried until it succeeds, fails with non-retriable error, or timer expires. CompletableFuture<Void> commitResult; commitResult = signalReconciliationStarted();

The first step marks the revoked partition as pendingRevocation, which prevents the partition's data from being returned in future consumer.poll() calls. However, when we get here, it's possible that a batch of records have just been returned to the application thread before the first step, but those records haven't been processed yet. So latestPartitionOffsets is not up to date yet. We need to wait for the next setLatestPartitionOffsets() call to happen. At that point, we know any record returned to the application will have been processed and no more records can be given to the application. So, it's safe to commit the offset at that point.

Hi @junrao
Thanks for the review.

Yes, this is indeed a problem. Since assignment reconciliation is triggered from a different path (ConsumerGroupHeartbeat) and not in the normal user app consume loop, I think we could update latestPartitionOffsets in ConsumerMembershipMananger#signalReconciliationStarted().

In this way, we could get the following benefits:

Getting the latest latestPartitionOffsets after marking the revoked partition as pendingRevocation.

We could avoid createOffsetCommitRequest and autoCommitSyncBeforeRevocationWithRetries always invoking subscription#allConsumed, which will lead to the gap between the app thread and the background thread.

WDYT?

if we get subscriptions.allConsumed() on signalReconciliationStarted, I'm afraid we could be retrieving positions that have been advanced in the app thread but not processed by the app yet? I believe this is what @junrao was referring to with:

We need to wait for the next setLatestPartitionOffsets() call to happen. At that point, we know any record returned to the application will have been processed and no more records can be given to the application. So, it's safe to commit the offset at that point.

Oops, it seems I misunderstood it 😓

This is a tricky one, but I wonder if there is a simple fix at a higher level. At the moment we're triggering reconciliation freely in the background (when polling all managers, polling the membershipMgr is the one triggering it), and as I see it, that's probably what's conceptually wrong here? Should we consider triggering reconciliations only when processing a PollEvent?

With that this situation here disappears, because we would be generating the commit to revoke before any fetching happens (and even considering that the commit needs to be retried, at that point we know we had marked the partition as pending for revocation already, so no new fetches for it).

Seems conceptually right and simple, but I could be missing something. What do you think?

It's a solution worth trying since we could see that every poll is the point at which the latest offsets have just been committed.
However, one concern is that the reconciliation process could be delayed if the user application's per-loop time is unstable or slow. The original purpose of the reconciliation mechanism was to allow the background thread to process reconciliation immediately and effectively, so I'm uncertain if this trade-off would be worthwhile. 🤔

yes, I get your concern (and I'm still going over all this), but this is how I see it now:

we would be only waiting to trigger the reconciliation, not to process it until the end (once triggered, it will carry on in the background as always, not blocking on anything new, just the commit request to complete and the callbacks, as always, we should keep this behaviour)

we're just saying we will align the reconciliation triggering with the consumer poll (like the classic does btw) because we need to wait for stable positions to start reconciling a new assignment. So yes, there is a delay to start reconciling, but it's for correctness: we have to commit before a rebalance, but we cannot guarantee we can commit correctly the consumed positions if we don't have stable positions.

Looking at the poll loop from a high level, these are the 3 main blocks:

app thread poll start (PollEvent)

update fetch positions

fetch

So we're saying we change to trigger a reconciliation only on 1, when we have stable positions, so we know the allConsumed to commit (and then rebalance). Of course that means that if we get a HB response with a new assignment right after the PollEvent (1), we would have to wait until the next PollEvent to start reconciling that assignment. But with the current version of triggering reconciliations freely in the background, that's exactly the root cause of the problem imo: we start a reconciliation when 2/3 are happening, and it's a mess because we cannot determine the allConsumed to commit, it's a moving target (until we know the records have been returned, and that's on the next PollEvent).

Thoughts?

More brainstorming improving this idea a bit more, all the above is important if we have auto-commit enabled only. Also it's important to keep resolving the topics names for the topic Ids assigned asap (to request the metadata needed, that part doesn't need to wait for any in-flight fetch), so even though we the need the above comment, we shouldn't throw away completely the attempt to reconcile on the membershipMgr.poll.

So one option that comes to mind is to keep calling maybeReconcile(canCommit=false) from the membershipMgr.poll to not delay the resolution of the topics, and also not delay reconciliation if autoCommit disabled (the new param is simply to short-circuit right after resolving metadata, before markReconciliationInProgress I guess) if autoCommitEnabled & !canCommit. Then we also need a call to maybeReconcile(canCommit=true) when a PollEvent is processed (this is what the above comment is about, basically the one to reconcile with autoCommit, only possible when

we know any record returned to the application will have been processed and no more records can be given to the application. So, it's safe to commit the offset at that point.

Makes sense to me.
In this way, we have two scenarios:

autoCommitEnabled=true

ConsumerMembershipManager.poll will invoke maybeReconcile(canCommit=false), which will only resolve the topic names for the assigned topic IDs and return early before markReconciliationInProgress due to canCommit=false. The PollEvent will then invoke maybeReconcile(canCommit=true) and complete the entire reconciliation process.

autoCommitEnabled=false

This follows a similar pattern to autoCommitEnabled=true, with the key difference being that we don't handle auto-commits here - instead, committing should be managed through onPartitionRevoked (users should invoke commitSync in their implementation).

This approach allows us to ensure certain positions before reconciliation, with only a minimal delay as trade-off. Sounds good to me.

junrao

@frankvicky : Thanks for the updated PR. One minor comment.

junrao · 2025-02-16T17:30:47Z

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

+            commitRequestManager.updateTimerAndMaybeCommit(event.pollTimeMs());
+            // all commit request generation points have been passed,
+            // so it's safe to notify the app thread could proceed and start fetching
+            event.future().complete(null);


Technically, the event hasn't been fully processed yet. So, it's a bit weird to complete the event midway. Perhaps we could introduce a separate future like reconcileAndAutoCommit and wait for that.

lianetm

Thanks @frankvicky ! Some minor comments aligning the docs with the changes

lianetm · 2025-02-17T14:00:18Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+                // This will trigger async auto-commits of consumed positions when hitting
+                // the interval time or on partition revocation
+                applicationEventHandler.add(event);
+                // Wait for reconciliation and auto-commit to complete to ensure all commit requests are processed


Suggested change

// Wait for reconciliation and auto-commit to complete to ensure all commit requests are processed

// Wait for reconciliation and auto-commit to be triggered, to ensure all commit requests retrieve the positions to commit

lianetm · 2025-02-17T14:11:22Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

                // Make sure to let the background thread know that we are still polling.
-                applicationEventHandler.add(new PollEvent(timer.currentTimeMs()));
+                // This will trigger async auto-commits of consumed positions when hitting
+                // the interval time or on partition revocation


Suggested change

// the interval time or on partition revocation

// the interval time or reconciling new assignments

lianetm · 2025-02-17T14:22:11Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

@@ -186,7 +186,6 @@ public NetworkClientDelegate.PollResult poll(final long currentTimeMs) {
            return drainPendingOffsetCommitRequests();
        }

-        maybeAutoCommitAsync();


The java doc stayed behind after this change (still reads The function will also try to autocommit the offsets, if feature is enabled.)

lianetm · 2025-02-17T15:49:15Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/events/PollEvent.java

+     *   <li>auto-commit on revocation</li>
+     *   <li>auto-commit</li>


Suggested change

* <li>auto-commit on revocation</li>

* <li>auto-commit</li>

* <li>auto-commit on rebalance</li>

* <li>auto-commit on the interval</li>

frankvicky · 2025-02-17T16:00:39Z

Hi @lianetm
I have just updated the patch based on the latest comments.
PTAL 🙇🏼

lianetm

Thanks! One more comment

lianetm · 2025-02-17T17:05:16Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+                applicationEventHandler.add(event);
+                // Wait for reconciliation and auto-commit to be triggered, to ensure all commit requests
+                // retrieve the positions to commit before proceeding with fetching new records
+                ConsumerUtils.getResult(event.reconcileAndAutoCommit());


should we pass the default api timeout here? Under normal execution, this will just complete right away (local actions in the background), but if the background thread is faulty (ie. died) and the event can't be processed, the consumer would hang here indefinitely (instead of timing out). Note that I suggest the default api timeout and not the timeout from param because we could have poll(ZERO), and that 0 shouldn't apply to the inter-thread communication, which is what we're doing here. Same for the blocking call we added for offsetsReady.

We do this same approach in other api calls btw, ex. seek. Makes sense?

Makes sense.
Previously, I thought it was crucial for the app thread to wait for reconciliation and auto-commit completion, but I didn't consider the possibility of a faulty background thread.
Given that, we also need to apply this timeout to offsetsReady, right?

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 856 in dee389d

ConsumerUtils.getResult(commitEvent.offsetsReady());

lianetm

Thanks @frankvicky ! Looks good to me. Let's just wait to see if @junrao has any other comments.

chia7712 · 2025-02-18T22:29:36Z

@frankvicky could you please fix the conflicts?

…ommit JIRA: KAFKA-18641 Please refer to jira ticket for further details. In short, application thread advances positions but allConsumed is called by background thread. This will lead to inconsistent between committed offset and actual consumed records # Conflicts: # clients/src/test/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessorTest.java

…me change

junrao

@frankvicky : Thanks for the updated PR. LGTM. Just a minor comment.

junrao · 2025-02-19T18:54:53Z

...nts/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerMembershipManager.java

@@ -123,8 +123,7 @@ public class ConsumerMembershipManager extends AbstractMembershipManager<Consume
    private final Optional<String> serverAssignor;

    /**
-     * Manager to perform commit requests needed before revoking partitions (if auto-commit is
-     * enabled)
+     * Manager to perform commit requests needed before rebalance (if auto-commit is enabled)


This makes it sound like that commitRequestManager is only used for auto offset commit, but it's used for commit calls from the users too.

Hi @junrao
IMHO, this javadoc actually describes the purpose of commitRequestManager in this specific class, where it's only used for auto-commit before rebalance.
While you're correct that commitRequestManager handles user commit calls too, that functionality isn't used here.
Does it make sense?

@frankvicky : Thanks for the explanation. This looks good to me then.

junrao

@frankvicky : Thanks for the explanation. LGTM

junrao · 2025-02-20T17:06:59Z

...nts/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerMembershipManager.java

@@ -123,8 +123,7 @@ public class ConsumerMembershipManager extends AbstractMembershipManager<Consume
    private final Optional<String> serverAssignor;

    /**
-     * Manager to perform commit requests needed before revoking partitions (if auto-commit is
-     * enabled)
+     * Manager to perform commit requests needed before rebalance (if auto-commit is enabled)


@frankvicky : Thanks for the explanation. This looks good to me then.

…ommit (#18737) Reviewers: Lianet Magrans <[email protected]>, Jun Rao <[email protected]>, Kirk True <[email protected]>

github-actions bot added triage PRs from the community consumer clients small Small PRs labels Jan 29, 2025

frankvicky force-pushed the KAFKA-18641 branch from 2b5ef0e to a5ae825 Compare January 29, 2025 07:19

chia7712 added the ci-approved label Jan 29, 2025

frankvicky force-pushed the KAFKA-18641 branch from a5ae825 to dd6767c Compare January 29, 2025 12:53

kirktrue added KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) labels Jan 29, 2025

kirktrue suggested changes Jan 29, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Jan 30, 2025

frankvicky force-pushed the KAFKA-18641 branch 2 times, most recently from 1117b1f to 970d2a2 Compare January 31, 2025 06:19

github-actions bot removed the small Small PRs label Jan 31, 2025

frankvicky force-pushed the KAFKA-18641 branch from 970d2a2 to 6f51b16 Compare January 31, 2025 06:28

frankvicky commented Jan 31, 2025

View reviewed changes

lianetm reviewed Jan 31, 2025

View reviewed changes

junrao reviewed Jan 31, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18641 branch from 6f51b16 to ce20e5c Compare February 2, 2025 02:57

github-actions bot added the small Small PRs label Feb 2, 2025

frankvicky force-pushed the KAFKA-18641 branch 5 times, most recently from bc643a9 to 7a229d9 Compare February 5, 2025 03:00

junrao reviewed Feb 16, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18641 branch 2 times, most recently from dfd031e to 7468c53 Compare February 17, 2025 03:38

lianetm reviewed Feb 17, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18641 branch from dee389d to 1c92dfe Compare February 18, 2025 02:43

lianetm approved these changes Feb 18, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18641 branch from 1c92dfe to fa2be8d Compare February 19, 2025 00:02

frankvicky added 13 commits February 19, 2025 13:58

KAFKA-18641: Unwrap unncessary Optional in commitAsync

25ced0f

KAFKA-18641: Unwrap unncessary Optional in commitSync

e3fc89e

KAFKA-18641: Address by comments

d813825

KAFKA-18641: Adjust CommitEvent and add a doc to canCommit

2da0307

KAFKA-18641: Adjust maybeReconcile

2894212

KAFKA-18641: Transfer PollEvent as CompletableApplicationEvent

48b26f5

KAFKA-18641: Address by comments

dfb85ed

KAFKA-18641: Address by comments

637c99f

KAFKA-18641: Address by comments

2d5e36e

KAFKA-18641: Amend comments and log messages to reflect the method na…

8f313be

…me change

KAFKA-18641: Add reconcileAndAutoCommit future to PollEvent

c156e9a

KAFKA-18641: Amend code comments and javadoc

db01a91

frankvicky force-pushed the KAFKA-18641 branch from fa2be8d to 6c89c78 Compare February 19, 2025 05:59

KAFKA-18641: Add timeout to reconcileAndAutoCommit, offsetsReady future

4b4d36e

frankvicky force-pushed the KAFKA-18641 branch from 6c89c78 to 4b4d36e Compare February 19, 2025 08:38

junrao reviewed Feb 19, 2025

View reviewed changes

junrao approved these changes Feb 20, 2025

View reviewed changes

lianetm merged commit 709bfc5 into apache:trunk Feb 20, 2025
9 checks passed

lianetm pushed a commit that referenced this pull request Feb 20, 2025

KAFKA-18641: AsyncKafkaConsumer could lose records with auto offset c…

35a372c

…ommit (#18737) Reviewers: Lianet Magrans <[email protected]>, Jun Rao <[email protected]>, Kirk True <[email protected]>

		applicationEventHandler.add(new PollEvent(timer.currentTimeMs()));
		applicationEventHandler.add(new PollEvent(timer.currentTimeMs(), subscriptions.allConsumed()));

		commitRequestManager.updateAutoCommitTimer(event.pollTimeMs());
		commitRequestManager.allConsumed(event.allConsumed());

		requestManagers.commitRequestManager.ifPresent(commitRequestManager ->
		commitRequestManager.allConsumed(subscriptions.allConsumed()));

	// Wait for reconciliation and auto-commit to complete to ensure all commit requests are processed
	// Wait for reconciliation and auto-commit to be triggered, to ensure all commit requests retrieve the positions to commit

	// the interval time or on partition revocation
	// the interval time or reconciling new assignments

KAFKA-18641: AsyncKafkaConsumer could lose records with auto offset commit #18737

KAFKA-18641: AsyncKafkaConsumer could lose records with auto offset commit #18737

Conversation

frankvicky commented Jan 29, 2025 • edited Loading

Committer Checklist (excluded from commit message)

frankvicky commented Jan 29, 2025

lianetm commented Jan 29, 2025

kirktrue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankvicky commented Jan 30, 2025 • edited Loading

lianetm commented Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankvicky commented Jan 31, 2025

lianetm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

autoCommitEnabled=true

autoCommitEnabled=false

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankvicky commented Feb 17, 2025

lianetm left a comment

Choose a reason for hiding this comment

lianetm Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

chia7712 commented Feb 18, 2025

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankvicky commented Jan 29, 2025 •

edited

Loading

frankvicky commented Jan 30, 2025 •

edited

Loading

lianetm commented Jan 30, 2025 •

edited

Loading

lianetm left a comment •

edited

Loading

lianetm Feb 6, 2025 •

edited

Loading

lianetm Feb 17, 2025 •

edited

Loading