Fix a correctness issue around referenceless expressions being evaluated as partition filters #4069
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Fixes a data correctness issue, when non-deterministic expressions without any reference columns are used, such as rand() as a filter on a Delta table. These filters were being evaluated as partition filters and getting double evaluated. This caused a filter such as
rand() < 0.5
to filter ~75% of the data (due to being double evaluated) instead of just 50%.Added a feature flag just in case for old behavior
How was this patch tested?
Added a unit test and tested the old behavior as well with a feature flag
Does this PR introduce any user-facing changes?
Filters such as rand() will not be double evaluated anymore