Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always persist a high watermark for source table #140

Open
valiantljk opened this issue Jun 20, 2023 · 2 comments
Open

Always persist a high watermark for source table #140

valiantljk opened this issue Jun 20, 2023 · 2 comments

Comments

@valiantljk
Copy link
Collaborator

valiantljk commented Jun 20, 2023

Current high watermark links to the partition locator. During incremental compaction, we expect two source of data as input. One is the compacted table, the other is the new delta. In rare cases, where no new delta exist. Only compacted table will go through the delta discovery and entire compaction. In the end, the high watermark recorded in round completion file is only from the compacted table.

In next round, when we retrieve the high watermark from round completion file, we are not able to get the high watermark of the source table, in some cases, we call it old_parent_stream_position.

Two options:

  • get the high watermark from delta property when rcf doesn't have it
  • persist the high watermark for source table in rcf
@pdames
Copy link
Member

pdames commented Jun 20, 2023

Since there's nothing to update aside from metadata in the case of no new deltas, are we also ensuring that we're not running through all data processing steps of hash bucketing, dedupe, and materialize?

@valiantljk
Copy link
Collaborator Author

Currently, it'll still go through the steps. We don't have a direct copy route yet. It seems to be a corner case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants