KAFKA-16792: Enable consumer unit tests that fail to fetch offsets only for new consumer with poll(0) #16982

FrankYang0529 · 2024-08-23T10:05:50Z

Fix following tests for CONSUMER group protocol:

testCurrentLag
testFetchStableOffsetThrowInPoll
testListOffsetShouldUpdateSubscriptions

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

lianetm · 2024-08-23T11:53:07Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

@@ -2471,6 +2477,8 @@ public void testCurrentLag(GroupProtocol groupProtocol) {

        // poll once to update with the current metadata
        consumer.poll(Duration.ofMillis(0));
+        TestUtils.waitForCondition(() -> client.requests().stream().anyMatch(request -> request.requestBuilder().apiKey().equals(ApiKeys.FIND_COORDINATOR)),


interesting, makes sense how you keep a single poll, and we just have to wait a bit for the background to actually generate the request. What about encapsulating this bit to make it clearer in all places it's used, to end up with something like:
TestUtils.waitForCondition(() -> requestGenerated(ApiKeys.FIND_COORDINATOR))

Updated it. Thanks for the suggestion.

TaiJuWu

LGTM

frankvicky

LGTM

lianetm · 2024-08-26T20:42:39Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

        client.respond(FindCoordinatorResponse.prepareResponse(Errors.NONE, groupId, metadata.fetch().nodes().get(0)));

        consumer.seek(tp0, 50L);
+
+        if (groupProtocol == GroupProtocol.CONSUMER) {


I see how you make the test pass, and that is fine and works, but I believe we could greatly simplify this test if we get the FindCoord, OffsetFetch and FetchRequests out of the picture, which seem to me are really not relevant to this test?

This test is all about partition offsets (end offsets stored in the leader), compared to a position set manually with seek, so not related at all with committed offsets or affected by fetch.

So, this is the simplification I'm thinking about:

we create the consumer without groupId -> this will ensure we don't send any OffsetFetch request. We don't really need them given that we're only playing with the partition offsets

we pause the partition right after assign -> this ensures that we don't issue fetch requests. We don't really need them for this test, and having them makes the test different for both consumers

With those 2 small changes, I would expect we can keep the same test for both consumer, without any specifics for the new consumer, and it would still be true to its purpose, testing what it has always tested. What do you think?

lianetm · 2024-08-26T21:05:30Z

Hey @FrankYang0529 , I took another look, left a comment for consideration. Thanks!

lianetm · 2024-08-27T16:01:07Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

+            } catch (UnsupportedVersionException e) {
+                return true;
+            }
+        }, "Failed to fetch stable offset");


I think this msg is not quite right. If this condition fails, it means that the "consumer failed to throw UnsupportedVersionException on poll" (we actually expect that it fails to fetch the offsets)

lianetm · 2024-08-27T17:09:38Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java


        // no error for no end offset (so unknown lag)
        assertEquals(OptionalLong.empty(), consumer.currentLag(tp0));

        // poll once again, which should return the list-offset response
        // and hence next call would return correct lag result
-        client.respond(listOffsetsResponse(singletonMap(tp0, 90L)));
+        Optional<ClientRequest> listOffsetRequest = client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(ApiKeys.LIST_OFFSETS)).findFirst();


this line seems tricky to read and we need it twice (for now). Would it be clearer to have a helper along the lines of findRequest(client, ApiKeys.LIST_OFFSETS)?

lianetm · 2024-08-27T17:15:01Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

        final FetchInfo fetchInfo = new FetchInfo(1L, 99L, 50L, 5);
-        client.respond(fetchResponse(singletonMap(tp0, fetchInfo)));
+        assertTrue(fetchRequest.isPresent());


we probably don't need this assertion given that we already checked that the request was generated on the waitForCondition on ln 2510?

Hi @lianetm, thanks for the review and suggestion. I addressed other comments. For this one, I would like to keep it. If we remove it, there may have warning message like 'Optional.get()' without 'isPresent()' check.

lianetm · 2024-08-27T17:16:36Z

Hey @FrankYang0529 , some other minor comments left, almost there. Thanks!

lianetm

Thanks for the updates @FrankYang0529 ! Couple of minor comments left.

lianetm · 2024-08-28T19:22:47Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

-    @EnumSource(value = GroupProtocol.class, names = "CLASSIC")
-    public void testListOffsetShouldUpdateSubscriptions(GroupProtocol groupProtocol) {
+    @EnumSource(GroupProtocol.class)
+    public void testListOffsetShouldUpdateSubscriptions(GroupProtocol groupProtocol) throws InterruptedException {


I guess we don't need this throw Interrupted here anymore?

Thanks for the reminder. Removed it.

lianetm · 2024-08-28T19:26:34Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

+    }
+
+    private boolean requestGenerated(MockClient client, ApiKeys apiKey) {
+        return client.requests().stream().anyMatch(request -> request.requestBuilder().apiKey().equals(apiKey));


could we maybe reuse and simplify here to return findRequest(client, apiKey).isPresent()?

Sorry, I applied @kirktrue's suggestion, so we can't use findRequest(client, apiKey).isPresent() here.

kirktrue

Thanks for the updates, @FrankYang0529!

Just a couple of questions and a minor readability change request.

Thanks!

kirktrue · 2024-08-28T23:52:29Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

+    private Optional<ClientRequest> findRequest(MockClient client, ApiKeys apiKey) {
+        return client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(apiKey)).findFirst();
+    }


It appears that the calls to this method quickly fail if there wasn't a match for the request type. How about something like this:

Suggested change

private Optional<ClientRequest> findRequest(MockClient client, ApiKeys apiKey) {

return client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(apiKey)).findFirst();

}

private ClientRequest findRequest(MockClient client, ApiKeys apiKey) {

Optional<ClientRequest> request = client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(apiKey)).findFirst();

assertTrue(request.isPresent(), "No " + apiKey + " request was submitted to the client... or whatever");

return request.get();

}

kirktrue · 2024-08-28T23:55:30Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

-
-        // poll once to update with the current metadata
-        consumer.poll(Duration.ofMillis(0));
-        client.respond(FindCoordinatorResponse.prepareResponse(Errors.NONE, groupId, metadata.fetch().nodes().get(0)));
-


Out of curiosity, why do we no longer want to ensure a find coordinator request/response occurred?

Like @lianetm's suggestion #16982 (comment). In testListOffsetShouldUpdateSubscriptions, we want to check endOffsets function. It doesn't need group coordinator.

lianetm · 2024-08-30T17:03:04Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

+    @EnumSource(GroupProtocol.class)
+    public void testFetchStableOffsetThrowInPoll(GroupProtocol groupProtocol) throws InterruptedException {
+        setupThrowableConsumer(groupProtocol);
+        TestUtils.waitForCondition(() -> {


this change still has me thinking. This test is about a single call to poll(ZERO), that is expected to throw an exception, but interesting fact is that the exception is generated when building the request here (it does not require the actual send or response). So I wonder if the async consumer should somehow ensure that when poll returns (even with low time), it has allowed for at least one run of the background thread runOnce?

Hi @lianetm, the ApplicationEventProcessor#process(PollEvent) only resets timer for some request managers. Even if we use ApplicationEventHandler#addAndGet, we still can't guarantee runOnce is executed. We may need ConsumerNetworkThread to send a BackgroundEvent to make sure the runOnce is happened. However, this approach may make the process more complex. If we want to make this, we probably can create another Jira to handle this. WDYT?

Agree that it's better to discuss separately. Definitely we could consider to align the behaviour of the 2 consumers a bit more regarding poll with low timeouts and the guarantees of requests sent, but there are tradeoffs to consider as @chia7712 pointed out on his comment on the Jira (totally agree on the trade-offs).

TaiJuWu · 2024-08-31T17:08:21Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java


+        client.prepareResponse(listOffsetsResponse(singletonMap(tp0, 90L)));


Could client.prepareResponse be moved to newConsumer before?

@FrankYang0529 could you address this? I guess the suggestion comes from the recent fix to some tests that were flaky for a similar situation (#17056)

Hi @lianetm and @TaiJuWu, thanks for the suggestion, but I think we can keep this line just before consumer.endOffsets.

In AsyncKafkaConsumer#endOffsets, it sends ListOffsetsEvent and ApplicationEventProcessor#process(ListOffsetsEvent) calls OffsetsRequestManager#fetchOffsets.

In OffsetsRequestManager#fetchOffsets, it builds the request and AsyncKafkaConsumer#endOffsets uses ApplicationEventHandler#addAndGet to wait for the result, so I think it's safe to put it just before consumer.endOffsets. WDYT? Thank you.

Thanks for your explanations.

makes sense, but I still have a concern with this (wonder if it would make this test flaky). Let's say the moment we get to prepare this response, there is another request that got generated, wouldn't that make that the endOffsets request will never find the response? (Not sure if I'm missing details of how prepareResponse works)

for the record, I was concerned here mostly about fetch requests (since we have a partition assigned and not paused), but just noticed we're not calling consumer.poll (so we shouldn't expect FETCH requests to be generated)

Still, worth mentioning that without this in-flight fix #17035 for Fetch, we could wrongly generate fetch requests even without polling. So enabling this test like this would be flaky for the new consumer without that fix I guess.

lianetm · 2024-09-10T10:43:40Z

Hey @FrankYang0529 , there are some related build failures on testCurrentLag, could you please take a look? Also please rebase to get the latest changes from trunk (we got a lot of unrelated failures too). Thanks!
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-16982/7/tests

FrankYang0529 · 2024-09-10T11:54:50Z

Hi @lianetm, thanks for the reminder. Rebased it to see latest result.

lianetm · 2024-09-10T18:49:56Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

@@ -2479,23 +2487,29 @@ public void testCurrentLag(GroupProtocol groupProtocol) {
        consumer.seek(tp0, 50L);
        consumer.poll(Duration.ofMillis(0));
        // requests: list-offset, fetch
-        assertEquals(2, client.inFlightRequestCount());


similar to the reasoning that led to this fix, there is a check above for:

assertEquals(0, client.inFlightRequestCount())

I would say that may be flaky for the new consumer, right?

We cannot ensure that no requests will be generated just because the last api we called shouldn't generate one. Given that we called poll before (and there is a background thread running) I expect there could be requests generated for fetching and fetching offsets maybe? The point of that assert is to ensure there was no ListOffsets request generated, so I would say that we should look for that exact request type, and assert there is none. Makes sense?

Hi @lianetm, thanks for catching this. I tried to add Thread.sleep(1000) before assertEquals(0, client.inFlightRequestCount()). It looks like both ListOffsetsRequest and OffsetFetchRequest will be sent after getting FindCoordinatorResponse. I remove the assertion now. Do you think that we should keep this assertion for CLASSIC group protocol only? Thank you.

Well definitely this bit is somewhere where the 2 consumers internasl are different, even though I see them conceptually aligned:

what's the same in both consumers: a call to consumer lag will "signal" that it needs to retrieve the log endOffsets

what's internally done different in both consumers: classic will only generate the request on the next poll (on the single thread it had and didn't want to block waiting for the offsets) vs async consumer, where the background thread poll will pick up the intention already expressed in the app thread and generate the request to get end offsets.

So I would say we keep the assertion (for the classic as you suggested), and it will be helpful to show this difference in the test actually. I would add an explanation for it too: Classic consumer does not send the LIST_OFFSETS right away (requires an explicit poll), different from the new async consumer, that will send the LIST_OFFSETS request in the background thread on the next background thread poll. Makes sense?

With all these tests we're enabling, worth running them repeatedly locally to try to spot any other flakiness similar to the ones we've been catching.

Hi @lianetm, I added the assertion back and comments. I run 50 rounds on my laptop and there is no error. Let's check latest CI result. Thank you.

Hey @FrankYang0529, this testCurrentLag still seems flaky even after the latest changes, I filed https://issues.apache.org/jira/browse/KAFKA-17560 with what I see and where I imagine the flakiness may be, but it needs more thinking probably. Here's a suggestion to make progress:

we could leave testCurrrentLag disabled for the new consumer on this PR (just as it was before)

we unblock the other 2 tests, that seems to pass consistently after your changes

we address testCurrentLag in a separate PR, with that jira I created (including the changes you had here, just that I think it needs more)

What do you think?

Hi @lianetm, thanks for filing the Jira. I would like to give "eventually updated" a try in this PR. If it still can't make the test stable, we can put it in next Jira. What do you think? Thank you.

With latest update, I run I=0; while ./gradlew cleanTest :clients:test --tests KafkaConsumerTest --rerun --fail-fast; do (( I=$I+1 )); echo "Completed run: $I"; sleep 1; done 100 times without error.

Hi @lianetm, do you think I need to rebase code, so we can have more CI running result? Or, it's good to keep current state? Thank you.

Hi @lianetm, I rebased the code again today and latest CI result looks good. Could you help me review it when you have time? Thank you.

FrankYang0529 · 2024-09-15T03:39:13Z

Hi @lianetm, I think the test result is stable now. Could you take a look when you have time? Thank you.

lianetm

Hey @FrankYang0529, one more comment here. Almost there! Thanks!

lianetm · 2024-09-26T10:25:07Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java


        final ConsumerRecords<String, String> records = (ConsumerRecords<String, String>) consumer.poll(Duration.ofMillis(1));
        assertEquals(5, records.count());
        assertEquals(55L, consumer.position(tp0));

        // correct lag result
-        assertEquals(OptionalLong.of(45L), consumer.currentLag(tp0));
+        // For AsyncKafkaConsumer, subscription sate is updated in background, so the result will eventually be updated.


lianetm · 2024-09-26T10:27:37Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java


        final ConsumerRecords<String, String> records = (ConsumerRecords<String, String>) consumer.poll(Duration.ofMillis(1));
        assertEquals(5, records.count());
        assertEquals(55L, consumer.position(tp0));

        // correct lag result
-        assertEquals(OptionalLong.of(45L), consumer.currentLag(tp0));
+        // For AsyncKafkaConsumer, subscription sate is updated in background, so the result will eventually be updated.
+        TestUtils.waitForCondition(() -> {


Is this change really needed? In this case we just did a successful fetch, so position is updated to 55 (ln 2541). We should be able to retrieve the lag of 45 (end offsets is already known to be 100). (Is not exactly the same case as above, where we needed to allow for the ListOffsets response to be processed in the background). Makes sense?

Yes, you're right, if consumer.position can get 45, then the subscription state has already been updated. Remove TestUtils.waitForCondition here. Thanks.

…ly for new consumer with poll(0) Signed-off-by: PoAn Yang <[email protected]>

lianetm

Thanks @FrankYang0529 ! LGTM.

…ly for new consumer with poll(0) (apache#16982) Reviewers: Lianet Magrans <[email protected]>, TaiJuWu <[email protected]>, Kirk True <[email protected]>, TengYao Chi <[email protected]>

lianetm reviewed Aug 23, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch from 1de6793 to 5581e1b Compare August 23, 2024 15:34

TaiJuWu approved these changes Aug 24, 2024

View reviewed changes

frankvicky approved these changes Aug 24, 2024

View reviewed changes

lianetm added consumer tests Test fixes (including flaky tests) ctr Consumer Threading Refactor (KIP-848) labels Aug 26, 2024

lianetm reviewed Aug 26, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch from 5581e1b to bcbc49d Compare August 27, 2024 10:30

kirktrue added the KIP-848 The Next Generation of the Consumer Rebalance Protocol label Aug 27, 2024

lianetm reviewed Aug 27, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch from bcbc49d to 1e821a1 Compare August 28, 2024 08:42

lianetm reviewed Aug 28, 2024

View reviewed changes

kirktrue suggested changes Aug 28, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch from 1e821a1 to 15c902c Compare August 29, 2024 14:08

lianetm reviewed Aug 30, 2024

View reviewed changes

TaiJuWu reviewed Aug 31, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch from 15c902c to 7037ef9 Compare September 10, 2024 11:54

lianetm reviewed Sep 10, 2024

View reviewed changes

FrankYang0529 force-pushed the KAFKA-16792 branch 5 times, most recently from a10b657 to daf9513 Compare September 14, 2024 03:16

FrankYang0529 force-pushed the KAFKA-16792 branch 2 times, most recently from fcac466 to 9cd97d3 Compare September 19, 2024 01:20

FrankYang0529 requested a review from lianetm September 25, 2024 00:27

kirktrue added the clients label Sep 26, 2024

lianetm reviewed Sep 26, 2024

View reviewed changes

KAFKA-16792: Enable consumer unit tests that fail to fetch offsets on…

b914fba

…ly for new consumer with poll(0) Signed-off-by: PoAn Yang <[email protected]>

FrankYang0529 force-pushed the KAFKA-16792 branch from 9cd97d3 to b914fba Compare September 26, 2024 11:41

lianetm approved these changes Sep 26, 2024

View reviewed changes

lianetm merged commit 3a1465e into apache:trunk Sep 26, 2024
7 of 9 checks passed

FrankYang0529 deleted the KAFKA-16792 branch September 27, 2024 00:17

-    private Optional<ClientRequest> findRequest(MockClient client, ApiKeys apiKey) {
-        return client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(apiKey)).findFirst();
-    }
+    private ClientRequest findRequest(MockClient client, ApiKeys apiKey) {
+        Optional<ClientRequest> request = client.requests().stream().filter(request -> request.requestBuilder().apiKey().equals(apiKey)).findFirst();
+        assertTrue(request.isPresent(), "No " + apiKey + " request was submitted to the client... or whatever");
+        return request.get();
+    }


		client.prepareResponse(listOffsetsResponse(singletonMap(tp0, 90L)));

KAFKA-16792: Enable consumer unit tests that fail to fetch offsets only for new consumer with poll(0) #16982

KAFKA-16792: Enable consumer unit tests that fail to fetch offsets only for new consumer with poll(0) #16982

Conversation

FrankYang0529 commented Aug 23, 2024

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaiJuWu left a comment

Choose a reason for hiding this comment

frankvicky left a comment

Choose a reason for hiding this comment

lianetm Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm commented Aug 26, 2024

Choose a reason for hiding this comment

lianetm Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYang0529 Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm commented Aug 27, 2024

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYang0529 Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

kirktrue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaiJuWu Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm commented Sep 10, 2024 • edited Loading

FrankYang0529 commented Sep 10, 2024

lianetm Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYang0529 Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYang0529 commented Sep 15, 2024

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

FrankYang0529 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

lianetm Aug 26, 2024 •

edited

Loading

lianetm Aug 27, 2024 •

edited

Loading

FrankYang0529 Aug 28, 2024 •

edited

Loading

FrankYang0529 Aug 29, 2024 •

edited

Loading

TaiJuWu Aug 31, 2024 •

edited

Loading

lianetm Sep 6, 2024 •

edited

Loading

lianetm Sep 10, 2024 •

edited

Loading

lianetm Sep 10, 2024 •

edited

Loading

lianetm commented Sep 10, 2024 •

edited

Loading

lianetm Sep 10, 2024 •

edited

Loading

FrankYang0529 Sep 12, 2024 •

edited

Loading

lianetm Sep 26, 2024 •

edited

Loading

FrankYang0529 Sep 26, 2024 •

edited

Loading