-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nulls checks to generated pruning predicates #14297
base: main
Are you sure you want to change the base?
Conversation
/// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”). | ||
/// Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is treated | ||
/// in the rewritten predicate described below.* | ||
/// * No rows should evaluate to `null`, even if statistics are missing (the statistics are themselves `null`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have much more docstrings to update, including examples, but I will hold off on that until I get feedback on the rest of the PR.
fn stat_column_expr( | ||
&mut self, | ||
column: &phys_expr::Column, | ||
column_expr: &Arc<dyn PhysicalExpr>, | ||
field: &Field, | ||
stat_type: StatisticsType, | ||
) -> Result<Arc<dyn PhysicalExpr>> { | ||
) -> Result<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to force callers to do something with the null check expression, or explcitly discard it.
This let me update all call sites with confidence of not having missed one, in addition to nicely abstracting away the logic.
@@ -2785,7 +2893,7 @@ mod tests { | |||
vec![lit(1), lit(2), lit(3)], | |||
false, | |||
)); | |||
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1"; | |||
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These expressions are long and gross... I wonder if we shouldn't do some combination of formatting the sql and using insta for snapshotting? Or can we serde dump the expressions and pretty format the json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree!
I'm not sure if DataFusion uses insta
, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.
That said, I know these improvements could become a large PR on their own.
cc @alamb — maybe you can find someone in the community interested in tackling this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insta is cool for sure. Maybe you can write up a ticket describing the need.
d68df19
to
e0d52d0
Compare
e0d52d0
to
4900b24
Compare
@alamb any appetite for us to move forward with this? I must admit that it is something that would greatly help us internally, but I did try to lay out a case above for why it might be beneficial to the wider community. |
4900b24
to
e3da849
Compare
I see this -- it makes sense to me. I am interested in helping mover it along, but there are currently a lot of PRs backed up on review. Any chance you want to help get some of them ready? Review the PR and then just ANd I am struggling to make the 45 release. I will work to clear out this backlog over the next few days |
Totally understood. No rush on this. I will look through PRs and see if I can help with any. |
This one one in particular I think needs some serious thinking to make sure it is correct. Perhaps @appletreeisyellow has some time and interest to help out reviewing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb for the nice add on! 💯 Very awesome work!
I carefully reviewed the logic behind adding IS NULL
checks, and everything looks solid to me. I also tested it against the InfluxDB's internal tests, and they all passed.
The core change is relatively small, with most of the updates coming from docstring improvements and query explain changes. I'm happy to take another look after you update the docstrings -- whether in this PR or a follow-up PR.
cc @alamb
ColumnarValue::Scalar(ScalarValue::Boolean(None)) => { | ||
// this should catch bugs in tests while keeping | ||
// the "old" behavior for PruningPredicate in production | ||
// that if any nulls are encountered they are treated as truthy | ||
#[cfg(debug_assertions)] | ||
panic!("no predicate should return null!") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only panic in debug builds when null values are detected 👍 You may want to update the docstring on line 702 to reflect that:
/// # Panics in debug builds
/// If `value` is not boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also add a warn!
log warning here as well on release builds
/// Along with a reference to the min/max/null_count/row_count column | ||
/// an expression for `{stats_col} IS NULL` is returned that should be `OR`ed | ||
/// with any expressions applied to the column itself such that if any value in `{stats_col}` | ||
/// is `null` we return `true` and not `null` because `null` is falsy, but we'd need to scan that | ||
/// container, causing confusion and preventing optimizations because of the inverted logic! | ||
/// | ||
/// Thus for the example above we'd arrive at the final expression `foo_max is null or foo_max > 5`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful!
// If either the null_count or the row_count are null we can't know anything, so return true | ||
let any_null_expr = Arc::new(phys_expr::BinaryExpr::new( | ||
row_count_is_null_expr, | ||
Operator::Or, | ||
null_count_column_is_null_expr, | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggressively returning true if either of the stat is not known 👍
@@ -2785,7 +2893,7 @@ mod tests { | |||
vec![lit(1), lit(2), lit(3)], | |||
false, | |||
)); | |||
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1"; | |||
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree!
I'm not sure if DataFusion uses insta
, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.
That said, I know these improvements could become a large PR on their own.
cc @alamb — maybe you can find someone in the community interested in tackling this!
@@ -2894,7 +3003,7 @@ mod tests { | |||
#[test] | |||
fn row_group_predicate_cast() -> Result<()> { | |||
let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]); | |||
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND CAST(c1_min@0 AS Int64) <= 1 AND 1 <= CAST(c1_max@1 AS Int64)"; | |||
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (CAST(c1_min@0 AS Int64) IS NULL OR CAST(c1_min@0 AS Int64) <= 1) AND (CAST(c1_max@1 AS Int64) IS NULL OR 1 <= CAST(c1_max@1 AS Int64))"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified in datafusion-cli ✅
> SELECT CAST(NULL AS int) IS NULL;
+--------------+
| NULL IS NULL |
+--------------+
| true |
+--------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT CAST('2' AS int) IS NULL;
+-------------------+
| Utf8("2") IS NULL |
+-------------------+
| false |
+-------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
@@ -2910,7 +3019,7 @@ mod tests { | |||
assert_eq!(predicate_expr.to_string(), expected_expr); | |||
|
|||
let expected_expr = | |||
"c1_null_count@1 != c1_row_count@2 AND TRY_CAST(c1_max@0 AS Int64) > 1"; | |||
"(c1_null_count@1 IS NULL OR c1_row_count@2 IS NULL OR c1_null_count@1 != c1_row_count@2) AND (TRY_CAST(c1_max@0 AS Int64) IS NULL OR TRY_CAST(c1_max@0 AS Int64) > 1)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified in datafusion-cli ✅
> SELECT TRY_CAST(NULL AS int) IS NULL;
+--------------+
| NULL IS NULL |
+--------------+
| true |
+--------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT TRY_CAST(1234.56 AS int) IS NULL;
+--------------------------+
| Float64(1234.56) IS NULL |
+--------------------------+
| false |
+--------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb and @appletreeisyellow
I am concerned about making the predicates more complicated for the reasons I outlined below
I realize that you can't use (<expr>) IS NOT FALSE
when pushing a pruning predicate directly into postgres.
Could you potentially just add it again, wiht an IS NOT FALSE?
So instead of
select file_id from .... WHERE <pruning_predicate_expr>`
Add something like
select file_id from .... WHERE <pruning_predicate_expr> OR (<pruning_predicate_expr IS NOT NULL)
🤔
@@ -287,11 +285,6 @@ pub trait PruningStatistics { | |||
/// predicate can never possibly be true). The container can be pruned (skipped) | |||
/// entirely. | |||
/// | |||
/// While `PruningPredicate` will never return a `NULL` value, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice simplification for sure 👍
/// the min/max values are not known. *Note that this is different than | ||
/// the SQL filter semantics where `NULL` means the row is filtered | ||
/// out.* | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please update the examples in the table below to include the new predicates?
/// Here are some examples of the rewritten predicates:
...
ColumnarValue::Scalar(ScalarValue::Boolean(None)) => { | ||
// this should catch bugs in tests while keeping | ||
// the "old" behavior for PruningPredicate in production | ||
// that if any nulls are encountered they are treated as truthy | ||
#[cfg(debug_assertions)] | ||
panic!("no predicate should return null!") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also add a warn!
log warning here as well on release builds
@@ -2475,7 +2587,7 @@ mod tests { | |||
fn row_group_predicate_eq() -> Result<()> { | |||
let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]); | |||
let expected_expr = | |||
"c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1"; | |||
"(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about turning a simple single column col <op> const
predicate like c1_null_count@2 != c1_row_count
into a more complex one like c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3
will make it hard for these predicates to be pushed down in some cases
At the moment the expressions look like this:
expr AND expr AND ...
And each expr
is typically a simple single column col <op> constant
After this change the expressions would be like
(expr OR expr OR ...) AND (expr OR expr ...)
I thought about using IS DISTINCT FROM
instead but I don't think that yields the correct value in all cases
datafusion/datafusion/expr-common/src/operator.rs
Lines 49 to 52 in 7f9a8c0
/// `IS DISTINCT FROM` (see [`distinct`]) | |
/// | |
/// [`distinct`]: arrow::compute::kernels::cmp::distinct | |
IsDistinctFrom, |
@@ -2785,7 +2893,7 @@ mod tests { | |||
vec![lit(1), lit(2), lit(3)], | |||
false, | |||
)); | |||
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1"; | |||
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insta is cool for sure. Maybe you can write up a ticket describing the need.
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
Currently pruning predicates may return
NULL
to indicate "this container should be included", thus usingNULL
as a truthy value. That is quite confusing, as explained in the various comments addressing it.Additionally this is a big inconvenience for anything using
PredicateRewriter
because you have to handle nulls yourself, i.e. if you pipe the result into aWHERE
clause you get the wrong result (silently!!). The workaround is to wrap the expression returned byPredicateRewriter
with(<expr>) IS NOT FALSE
which makesNULL
truthy. This has the unfortunate consequence of breaking down a simple binary expression into a non sargable one. This poses a problem for systems that may want to store statistics in a DMBS with indexes. For example, if I add an index oncol1_min
it can't be used because the(...) IS NOT FALSE
prevents anything from being pushed down into indexes.This PR addresses both problems by introducing checks for nulls in the stats columns in the right places such that we can now promise that the predicates always return
true
.Since we make no promises about the produced predicate this should not be a breaking change.
This does not read any extra columns and the null checks should be very cheap, so I do not expect this to have any performance impact on systems evaluating statistics in memory (like DataFusion does internally for parquet row group and page statistics).