Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-14572: Migrate EmbeddedKafkaCluster used by Streams integration tests from EmbeddedZookeeper to KRaft #17016

Merged
merged 12 commits into from
Sep 27, 2024

Conversation

OmniaGM
Copy link
Collaborator

@OmniaGM OmniaGM commented Aug 27, 2024

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Aug 27, 2024

Still fixing few test cases please don't review yet

@OmniaGM OmniaGM added streams tests Test fixes (including flaky tests) labels Aug 27, 2024
@OmniaGM OmniaGM force-pushed the KAFKA-14572-stream-kraft branch 4 times, most recently from 95080bb to 0ff51bd Compare August 28, 2024 13:38
@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Aug 29, 2024

@lucasbru I have 4 test cases failing and most of them seem to be related to AT_LEAST_ONCE feature. I can see that when we try to consume data from output topic we don't get all the records back. During debugging I noticed that all records potentially make it to the output topic but it take long time and I don't want to increase the timeout. Is there is anyway in stream to check if all records made it before pulling the records? Or any suggestions here?

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Aug 29, 2024

@lucasbru I have 4 test cases failing and most of them seem to be related to AT_LEAST_ONCE feature. I can see that when we try to consume data from output topic we don't get all the records back. During debugging I noticed that all records potentially make it to the output topic but it take long time and I don't want to increase the timeout. Is there is anyway in stream to check if all records made it before pulling the records? Or any suggestions here?

I changed the order for {StreamsConfig.AT_LEAST_ONCE, StreamsConfig.EXACTLY_ONCE, StreamsConfig.EXACTLY_ONCE_V2} to {StreamsConfig.EXACTLY_ONCE, StreamsConfig.AT_LEAST_ONCE, StreamsConfig.EXACTLY_ONCE_V2} and I don't think it's related to AT_LEAST_ONCE per say but whatever test come first will have a problem polling the records correctly.

@lucasbru
Copy link
Member

@OmniaGM Looks like something in kraft is just taking longer? Have we not observed this problem with the connect unit tests? I wonder if there are some settings that we could use to speed things up.

@lucasbru
Copy link
Member

lucasbru commented Sep 2, 2024

Could we try non-combined mode with this?

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Sep 2, 2024

@OmniaGM Looks like something in kraft is just taking longer? Have we not observed this problem with the connect unit tests? I wonder if there are some settings that we could use to speed things up.

It did and I believe the solution was to wait for a warmup producer and consumer to work before any tests

Could we try non-combined mode with this?

Will give it a try.

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Sep 2, 2024

Could we try non-combined mode with this?

I have tried the non-combined mode but nothing changed. I also tried to make sure the cluster is warmed up by producing to each broker but still didn't make any diff. I am just confused because for test like JoinGracePeriodDurabilityIntegrationTest we are able to publish and consume normally till we restart the stream and then we fail because we are expecting 3 but we are getting 2 records. So am confused why before restarting the stream all tests passes but restarting everything fails only on the first run of the test case

@lucasbru
Copy link
Member

lucasbru commented Sep 3, 2024

EDIT: that's probably not the problem :) but I don't quite understand the problem either

I checked JoinGracePeriodDurabilityIntegrationTest. This one seems to test something specifically about how time progresses. I couldn't find the problem yet, but I wonder if it's somehow related to mocktime... I'll have another look later

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Sep 5, 2024

EDIT: that's probably not the problem :) but I don't quite understand the problem either

I checked JoinGracePeriodDurabilityIntegrationTest. This one seems to test something specifically about how time progresses. I couldn't find the problem yet, but I wonder if it's somehow related to mocktime... I'll have another look later

I'll try to have another look later this week

@bbejeck
Copy link
Contributor

bbejeck commented Sep 19, 2024

Hi @OmniaGM - I took a pass and and found the problem.

For the failing tests, the EmbeddedKafkaCluster started with time 0; subsequently, the log cleaner would run and delete the log. If we use the constructor overload of the EmbeddedKafkaCluster that doesn't take a time parameter, it will initialize to the current time. The other fix for tests with low timestamp values (0, 5000, etc.) was to create a static variable with the current time and add it to the timestamp, preserving the test logic. I've pushed a PR to demonstrate the changes for updating this one

I'm still looking at one failure in the RestoreIntegrationTest that I think is unrelated to timestamps. Should be fixed now

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Sep 20, 2024

Hi @OmniaGM - I took a pass and and found the problem.

For the failing tests, the EmbeddedKafkaCluster started with time 0; subsequently, the log cleaner would run and delete the log. If we use the constructor overload of the EmbeddedKafkaCluster that doesn't take a time parameter, it will initialize to the current time. The other fix for tests with low timestamp values (0, 5000, etc.) was to create a static variable with the current time and add it to the timestamp, preserving the test logic. I've pushed a PR to demonstrate the changes for updating this one

I'm still looking at one failure in the RestoreIntegrationTest that I think is unrelated to timestamps. Should be fixed now

Thanks @bbejeck for spotting this, I updated the pr now

@bbejeck
Copy link
Contributor

bbejeck commented Sep 20, 2024

@OmniaGM I just looked at the build report from my sample PR, and there were a bunch of failures for the integration tests, where they were all passing locally so we'll see how it goes for you

@bbejeck
Copy link
Contributor

bbejeck commented Sep 26, 2024

Hey @OmniaGM - great news - I made a few tweeks and the dummy PR I have has 2 green builds for Java 11 and Java 17 . I just noticed the same here, we've done away with the jenkins build so if you update I think we'll have a clean build.

Ping me after updating and we'll get this reviewed and merged!

@OmniaGM
Copy link
Collaborator Author

OmniaGM commented Sep 27, 2024

Hey @OmniaGM - great news - I made a few tweeks and the dummy PR I have has 2 green builds for Java 11 and Java 17 . I just noticed the same here, we've done away with the jenkins build so if you update I think we'll have a clean build.

Ping me after updating and we'll get this reviewed and merged!

Thanks @bbejeck I have pulled your fixes with some changes. I used IntegrationTestUtils.startApplicationAndWaitUntilRunning instead of use

streams.start();
TestUtils.waitForCondition(() -> streams.state() == KafkaStreams.State.RUNNING, "KafkaStreams not running in time");

Tests are passing locally will wait for Jenkins

@bbejeck
Copy link
Contributor

bbejeck commented Sep 27, 2024

@OmniaGM SGTM, thanks for updating

Copy link
Contributor

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @OmniaGM! LGTM

@bbejeck bbejeck merged commit 1854d4b into apache:trunk Sep 27, 2024
9 checks passed
@bbejeck
Copy link
Contributor

bbejeck commented Sep 27, 2024

Merged #17016 to trunk

bbejeck pushed a commit to bbejeck/kafka that referenced this pull request Sep 28, 2024
… tests from EmbeddedZookeeper to KRaft (apache#17016)

Migrate the EmbeddedKafkaCluster from the EmbeddedZookeeper to KRaft

Reviewers Bill Bejeck <[email protected]>
airlock-confluentinc bot pushed a commit to confluentinc/kafka that referenced this pull request Sep 30, 2024
… tests from EmbeddedZookeeper to KRaft (apache#17016)

Migrate the EmbeddedKafkaCluster from the EmbeddedZookeeper to KRaft

Reviewers Bill Bejeck <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
streams tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants