Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nulls checks to generated pruning predicates #14297

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

Currently pruning predicates may return NULL to indicate "this container should be included", thus using NULL as a truthy value. That is quite confusing, as explained in the various comments addressing it.

Additionally this is a big inconvenience for anything using PredicateRewriter because you have to handle nulls yourself, i.e. if you pipe the result into a WHERE clause you get the wrong result (silently!!). The workaround is to wrap the expression returned by PredicateRewriter with (<expr>) IS NOT FALSE which makes NULL truthy. This has the unfortunate consequence of breaking down a simple binary expression into a non sargable one. This poses a problem for systems that may want to store statistics in a DMBS with indexes. For example, if I add an index on col1_min it can't be used because the (...) IS NOT FALSE prevents anything from being pushed down into indexes.

This PR addresses both problems by introducing checks for nulls in the stats columns in the right places such that we can now promise that the predicates always return true.

Since we make no promises about the produced predicate this should not be a breaking change.

This does not read any extra columns and the null checks should be very cheap, so I do not expect this to have any performance impact on systems evaluating statistics in memory (like DataFusion does internally for parquet row group and page statistics).

@github-actions github-actions bot added the optimizer Optimizer rules label Jan 25, 2025
/// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”).
/// Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is treated
/// in the rewritten predicate described below.*
/// * No rows should evaluate to `null`, even if statistics are missing (the statistics are themselves `null`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have much more docstrings to update, including examples, but I will hold off on that until I get feedback on the rest of the PR.

fn stat_column_expr(
&mut self,
column: &phys_expr::Column,
column_expr: &Arc<dyn PhysicalExpr>,
field: &Field,
stat_type: StatisticsType,
) -> Result<Arc<dyn PhysicalExpr>> {
) -> Result<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to force callers to do something with the null check expression, or explcitly discard it.
This let me update all call sites with confidence of not having missed one, in addition to nicely abstracting away the logic.

@@ -2785,7 +2893,7 @@ mod tests {
vec![lit(1), lit(2), lit(3)],
false,
));
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These expressions are long and gross... I wonder if we shouldn't do some combination of formatting the sql and using insta for snapshotting? Or can we serde dump the expressions and pretty format the json?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree!

I'm not sure if DataFusion uses insta, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.

That said, I know these improvements could become a large PR on their own.

cc @alamb — maybe you can find someone in the community interested in tackling this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insta is cool for sure. Maybe you can write up a ticket describing the need.

@adriangb adriangb force-pushed the null-checks-pruning branch from d68df19 to e0d52d0 Compare January 25, 2025 19:55
@github-actions github-actions bot added the core Core DataFusion crate label Jan 25, 2025
@adriangb adriangb force-pushed the null-checks-pruning branch from e0d52d0 to 4900b24 Compare January 25, 2025 19:56
@adriangb
Copy link
Contributor Author

@alamb any appetite for us to move forward with this? I must admit that it is something that would greatly help us internally, but I did try to lay out a case above for why it might be beneficial to the wider community.

@adriangb adriangb force-pushed the null-checks-pruning branch from 4900b24 to e3da849 Compare January 25, 2025 20:08
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 25, 2025
@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

I see this -- it makes sense to me. I am interested in helping mover it along, but there are currently a lot of PRs backed up on review. Any chance you want to help get some of them ready? Review the PR and then just @alamb me when it is ready if you can can

ANd I am struggling to make the 45 release. I will work to clear out this backlog over the next few days

Screenshot 2025-01-27 at 4 40 25 PM

@adriangb
Copy link
Contributor Author

Totally understood. No rush on this. I will look through PRs and see if I can help with any.

@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

This one one in particular I think needs some serious thinking to make sure it is correct. Perhaps @appletreeisyellow has some time and interest to help out reviewing

@appletreeisyellow
Copy link
Contributor

@adriangb @alamb The PR description makes sense to me. I plan to give it a deeper review either later today or tomorrow

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb for the nice add on! 💯 Very awesome work!

I carefully reviewed the logic behind adding IS NULL checks, and everything looks solid to me. I also tested it against the InfluxDB's internal tests, and they all passed.

The core change is relatively small, with most of the updates coming from docstring improvements and query explain changes. I'm happy to take another look after you update the docstrings -- whether in this PR or a follow-up PR.

cc @alamb

Comment on lines +712 to +718
ColumnarValue::Scalar(ScalarValue::Boolean(None)) => {
// this should catch bugs in tests while keeping
// the "old" behavior for PruningPredicate in production
// that if any nulls are encountered they are treated as truthy
#[cfg(debug_assertions)]
panic!("no predicate should return null!")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only panic in debug builds when null values are detected 👍 You may want to update the docstring on line 702 to reflect that:

    /// # Panics in debug builds
    /// If `value` is not boolean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add a warn! log warning here as well on release builds

Comment on lines +820 to +826
/// Along with a reference to the min/max/null_count/row_count column
/// an expression for `{stats_col} IS NULL` is returned that should be `OR`ed
/// with any expressions applied to the column itself such that if any value in `{stats_col}`
/// is `null` we return `true` and not `null` because `null` is falsy, but we'd need to scan that
/// container, causing confusion and preventing optimizations because of the inverted logic!
///
/// Thus for the example above we'd arrive at the final expression `foo_max is null or foo_max > 5`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful!

Comment on lines +1350 to +1355
// If either the null_count or the row_count are null we can't know anything, so return true
let any_null_expr = Arc::new(phys_expr::BinaryExpr::new(
row_count_is_null_expr,
Operator::Or,
null_count_column_is_null_expr,
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggressively returning true if either of the stat is not known 👍

@@ -2785,7 +2893,7 @@ mod tests {
vec![lit(1), lit(2), lit(3)],
false,
));
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree!

I'm not sure if DataFusion uses insta, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.

That said, I know these improvements could become a large PR on their own.

cc @alamb — maybe you can find someone in the community interested in tackling this!

@@ -2894,7 +3003,7 @@ mod tests {
#[test]
fn row_group_predicate_cast() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]);
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND CAST(c1_min@0 AS Int64) <= 1 AND 1 <= CAST(c1_max@1 AS Int64)";
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (CAST(c1_min@0 AS Int64) IS NULL OR CAST(c1_min@0 AS Int64) <= 1) AND (CAST(c1_max@1 AS Int64) IS NULL OR 1 <= CAST(c1_max@1 AS Int64))";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified in datafusion-cli ✅

> SELECT CAST(NULL AS int) IS NULL;
+--------------+
| NULL IS NULL |
+--------------+
| true         |
+--------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

> SELECT CAST('2' AS int) IS NULL;
+-------------------+
| Utf8("2") IS NULL |
+-------------------+
| false             |
+-------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

@@ -2910,7 +3019,7 @@ mod tests {
assert_eq!(predicate_expr.to_string(), expected_expr);

let expected_expr =
"c1_null_count@1 != c1_row_count@2 AND TRY_CAST(c1_max@0 AS Int64) > 1";
"(c1_null_count@1 IS NULL OR c1_row_count@2 IS NULL OR c1_null_count@1 != c1_row_count@2) AND (TRY_CAST(c1_max@0 AS Int64) IS NULL OR TRY_CAST(c1_max@0 AS Int64) > 1)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified in datafusion-cli ✅

> SELECT TRY_CAST(NULL AS int) IS NULL;
+--------------+
| NULL IS NULL |
+--------------+
| true         |
+--------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

> SELECT TRY_CAST(1234.56 AS int) IS NULL;
+--------------------------+
| Float64(1234.56) IS NULL |
+--------------------------+
| false                    |
+--------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @appletreeisyellow

I am concerned about making the predicates more complicated for the reasons I outlined below

I realize that you can't use (<expr>) IS NOT FALSE when pushing a pruning predicate directly into postgres.

Could you potentially just add it again, wiht an IS NOT FALSE?

So instead of

select file_id from .... WHERE <pruning_predicate_expr>`

Add something like

select file_id from .... WHERE <pruning_predicate_expr> OR (<pruning_predicate_expr IS NOT NULL)

🤔

@@ -287,11 +285,6 @@ pub trait PruningStatistics {
/// predicate can never possibly be true). The container can be pruned (skipped)
/// entirely.
///
/// While `PruningPredicate` will never return a `NULL` value, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simplification for sure 👍

/// the min/max values are not known. *Note that this is different than
/// the SQL filter semantics where `NULL` means the row is filtered
/// out.*
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please update the examples in the table below to include the new predicates?

/// Here are some examples of the rewritten predicates:
...

Comment on lines +712 to +718
ColumnarValue::Scalar(ScalarValue::Boolean(None)) => {
// this should catch bugs in tests while keeping
// the "old" behavior for PruningPredicate in production
// that if any nulls are encountered they are treated as truthy
#[cfg(debug_assertions)]
panic!("no predicate should return null!")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add a warn! log warning here as well on release builds

@@ -2475,7 +2587,7 @@ mod tests {
fn row_group_predicate_eq() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]);
let expected_expr =
"c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1";
"(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about turning a simple single column col <op> const predicate like c1_null_count@2 != c1_row_count into a more complex one like c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3 will make it hard for these predicates to be pushed down in some cases

At the moment the expressions look like this:

expr AND expr AND ...

And each expr is typically a simple single column col <op> constant

After this change the expressions would be like

(expr OR expr OR ...) AND (expr OR expr ...)

I thought about using IS DISTINCT FROM instead but I don't think that yields the correct value in all cases

/// `IS DISTINCT FROM` (see [`distinct`])
///
/// [`distinct`]: arrow::compute::kernels::cmp::distinct
IsDistinctFrom,

@@ -2785,7 +2893,7 @@ mod tests {
vec![lit(1), lit(2), lit(3)],
false,
));
let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insta is cool for sure. Maybe you can write up a ticket describing the need.

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft February 6, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants