Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8940] Fix Bloom Index Partitioner to distribute keys uniformly across partitions #12741

Merged
merged 4 commits into from
Feb 26, 2025

Conversation

vamsikarnika
Copy link
Contributor

@vamsikarnika vamsikarnika commented Jan 30, 2025

Change Logs

Bloom Index causes data skew when bucketized partitioning is used, during repartition and sorting stage when there are hollow buckets created. This happens when there are lot of writes to one partition and few writes to other partitions.

In this pr, we're partitioning based on the fileId + recordKey sort partitioner which distributes the keys equally while keeping same fileIds together.

Screenshot 2025-01-17 at 9 29 00 PM (1)

Impact

Bloom Index can now use fileId and recordkey based partitioner to distribute the comparisons equally across partitions where bucketized partitioning is causing data skew.

Risk level (write none, low medium or high below)

Medium

Added Functional Tests

Documentation Update

public static final ConfigProperty<String> BLOOM_INDEX_FILE_GROUP_ID_KEY_SORT_PARTITIONER = ConfigProperty
      .key("hoodie.bloom.index.fileId.key.sort.partitioner")
      .defaultValue("false")
      .markAdvanced()
      .withDocumentation("Only applies if index type is BLOOM. "
          + "When true, fileId and key sort based partitioning is enabled "
          + "This reduces skew seen in bucket based bloom index lookup");

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@vamsikarnika vamsikarnika force-pushed the fix_bloom_index_partitioner_v3 branch from 81853e0 to 666815b Compare January 30, 2025 11:54
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Jan 30, 2025
@vamsikarnika vamsikarnika changed the title Implement FileId + RecordKey based sort partitioning to reduce skew i… [HUDI-8940] Fix Bloom Index Partitioner to distribute keys uniformly across partitions Jan 30, 2025
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Feb 24, 2025
@vamsikarnika vamsikarnika force-pushed the fix_bloom_index_partitioner_v3 branch from ebfb115 to 3a5e401 Compare February 24, 2025 19:21
@github-actions github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:M PR with lines of changes in (100, 300] labels Feb 24, 2025
@yihua yihua self-assigned this Feb 24, 2025
@yihua yihua force-pushed the fix_bloom_index_partitioner_v3 branch from 3a5e401 to 5a8ec1d Compare February 26, 2025 00:46
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Feb 26, 2025
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after I fixed the naming.

@apache apache deleted a comment from hudi-bot Feb 26, 2025
@apache apache deleted a comment from hudi-bot Feb 26, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 62af211 into apache:master Feb 26, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-1.0.2 size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants