Add nulls checks to generated pruning predicates #14297

adriangb · 2025-01-25T18:39:28Z

Currently pruning predicates may return NULL to indicate "this container should be included", thus using NULL as a truthy value. That is quite confusing, as explained in the various comments addressing it.

Additionally this is a big inconvenience for anything using PredicateRewriter because you have to handle nulls yourself, i.e. if you pipe the result into a WHERE clause you get the wrong result (silently!!). The workaround is to wrap the expression returned by PredicateRewriter with (<expr>) IS NOT FALSE which makes NULL truthy. This has the unfortunate consequence of breaking down a simple binary expression into a non sargable one. This poses a problem for systems that may want to store statistics in a DMBS with indexes. For example, if I add an index on col1_min it can't be used because the (...) IS NOT FALSE prevents anything from being pushed down into indexes.

This PR addresses both problems by introducing checks for nulls in the stats columns in the right places such that we can now promise that the predicates always return true.

Since we make no promises about the produced predicate this should not be a breaking change.

This does not read any extra columns and the null checks should be very cheap, so I do not expect this to have any performance impact on systems evaluating statistics in memory (like DataFusion does internally for parquet row group and page statistics).

adriangb · 2025-01-25T18:40:05Z

datafusion/physical-optimizer/src/pruning.rs

-/// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”).
-///   Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is treated
-///   in the rewritten predicate described below.*
+/// * No rows should evaluate to `null`, even if statistics are missing (the statistics are themselves `null`).


I have much more docstrings to update, including examples, but I will hold off on that until I get feedback on the rest of the PR.

adriangb · 2025-01-25T18:40:58Z

datafusion/physical-optimizer/src/pruning.rs

    fn stat_column_expr(
        &mut self,
        column: &phys_expr::Column,
        column_expr: &Arc<dyn PhysicalExpr>,
        field: &Field,
        stat_type: StatisticsType,
-    ) -> Result<Arc<dyn PhysicalExpr>> {
+    ) -> Result<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)> {


The idea here is to force callers to do something with the null check expression, or explcitly discard it.
This let me update all call sites with confidence of not having missed one, in addition to nicely abstracting away the logic.

adriangb · 2025-01-25T18:41:50Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2785,7 +2893,7 @@ mod tests {
            vec![lit(1), lit(2), lit(3)],
            false,
        ));
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";


These expressions are long and gross... I wonder if we shouldn't do some combination of formatting the sql and using insta for snapshotting? Or can we serde dump the expressions and pretty format the json?

Totally agree!

I'm not sure if DataFusion uses insta, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.

That said, I know these improvements could become a large PR on their own.

cc @alamb — maybe you can find someone in the community interested in tackling this!

insta is cool for sure. Maybe you can write up a ticket describing the need.

adriangb · 2025-01-25T20:06:37Z

@alamb any appetite for us to move forward with this? I must admit that it is something that would greatly help us internally, but I did try to lay out a case above for why it might be beneficial to the wider community.

alamb · 2025-01-27T21:45:33Z

I see this -- it makes sense to me. I am interested in helping mover it along, but there are currently a lot of PRs backed up on review. Any chance you want to help get some of them ready? Review the PR and then just @alamb me when it is ready if you can can

ANd I am struggling to make the 45 release. I will work to clear out this backlog over the next few days

adriangb · 2025-01-27T21:54:21Z

Totally understood. No rush on this. I will look through PRs and see if I can help with any.

alamb · 2025-01-30T18:09:24Z

This one one in particular I think needs some serious thinking to make sure it is correct. Perhaps @appletreeisyellow has some time and interest to help out reviewing

appletreeisyellow · 2025-01-30T20:34:34Z

@adriangb @alamb The PR description makes sense to me. I plan to give it a deeper review either later today or tomorrow

appletreeisyellow

Thank you @adriangb for the nice add on! 💯 Very awesome work!

I carefully reviewed the logic behind adding IS NULL checks, and everything looks solid to me. I also tested it against the InfluxDB's internal tests, and they all passed.

The core change is relatively small, with most of the updates coming from docstring improvements and query explain changes. I'm happy to take another look after you update the docstrings -- whether in this PR or a follow-up PR.

cc @alamb

appletreeisyellow · 2025-02-01T03:59:42Z

datafusion/physical-optimizer/src/pruning.rs

+            ColumnarValue::Scalar(ScalarValue::Boolean(None)) => {
+                // this should catch bugs in tests while keeping
+                // the "old" behavior for PruningPredicate in production
+                // that if any nulls are encountered they are treated as truthy
+                #[cfg(debug_assertions)]
+                panic!("no predicate should return null!")
+            }


Only panic in debug builds when null values are detected 👍 You may want to update the docstring on line 702 to reflect that:

/// # Panics in debug builds /// If `value` is not boolean

Maybe we can also add a warn! log warning here as well on release builds

appletreeisyellow · 2025-02-01T04:01:30Z

datafusion/physical-optimizer/src/pruning.rs

+    /// Along with a reference to the min/max/null_count/row_count column
+    /// an expression for `{stats_col} IS NULL` is returned that should be `OR`ed
+    /// with any expressions applied to the column itself such that if any value in `{stats_col}`
+    /// is `null` we return `true` and not `null` because `null` is falsy, but we'd need to scan that
+    /// container, causing confusion and preventing optimizations because of the inverted logic!
+    ///
+    /// Thus for the example above we'd arrive at the final expression `foo_max is null or foo_max > 5`.


This is helpful!

appletreeisyellow · 2025-02-01T04:38:33Z

datafusion/physical-optimizer/src/pruning.rs

+                        // If either the null_count or the row_count are null we can't know anything, so return true
+                        let any_null_expr = Arc::new(phys_expr::BinaryExpr::new(
+                            row_count_is_null_expr,
+                            Operator::Or,
+                            null_count_column_is_null_expr,
+                        ));


Aggressively returning true if either of the stat is not known 👍

appletreeisyellow · 2025-02-01T04:57:21Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2785,7 +2893,7 @@ mod tests {
            vec![lit(1), lit(2), lit(3)],
            false,
        ));
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";


Totally agree!

I'm not sure if DataFusion uses insta, but running cargo insta review would definitely make the development experience much easier and quicker when updating tests. Serializing and pretty-printing the expressions would also significantly improve readability.

That said, I know these improvements could become a large PR on their own.

cc @alamb — maybe you can find someone in the community interested in tackling this!

appletreeisyellow · 2025-02-01T05:00:02Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2894,7 +3003,7 @@ mod tests {
    #[test]
    fn row_group_predicate_cast() -> Result<()> {
        let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]);
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND CAST(c1_min@0 AS Int64) <= 1 AND 1 <= CAST(c1_max@1 AS Int64)";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (CAST(c1_min@0 AS Int64) IS NULL OR CAST(c1_min@0 AS Int64) <= 1) AND (CAST(c1_max@1 AS Int64) IS NULL OR 1 <= CAST(c1_max@1 AS Int64))";


appletreeisyellow · 2025-02-01T05:04:57Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2910,7 +3019,7 @@ mod tests {
        assert_eq!(predicate_expr.to_string(), expected_expr);

        let expected_expr =
-            "c1_null_count@1 != c1_row_count@2 AND TRY_CAST(c1_max@0 AS Int64) > 1";
+            "(c1_null_count@1 IS NULL OR c1_row_count@2 IS NULL OR c1_null_count@1 != c1_row_count@2) AND (TRY_CAST(c1_max@0 AS Int64) IS NULL OR TRY_CAST(c1_max@0 AS Int64) > 1)";


I verified in datafusion-cli ✅

> SELECT TRY_CAST(NULL AS int) IS NULL; +--------------+ | NULL IS NULL | +--------------+ | true | +--------------+ 1 row(s) fetched. Elapsed 0.001 seconds. > SELECT TRY_CAST(1234.56 AS int) IS NULL; +--------------------------+ | Float64(1234.56) IS NULL | +--------------------------+ | false | +--------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds.

alamb

Thank you @adriangb and @appletreeisyellow

I am concerned about making the predicates more complicated for the reasons I outlined below

I realize that you can't use (<expr>) IS NOT FALSE when pushing a pruning predicate directly into postgres.

Could you potentially just add it again, wiht an IS NOT FALSE?

So instead of

select file_id from .... WHERE <pruning_predicate_expr>`

Add something like

select file_id from .... WHERE <pruning_predicate_expr> OR (<pruning_predicate_expr IS NOT NULL)

🤔

alamb · 2025-02-01T14:36:16Z

datafusion/physical-optimizer/src/pruning.rs

@@ -287,11 +285,6 @@ pub trait PruningStatistics {
 ///   predicate can never possibly be true). The container can be pruned (skipped)
 ///   entirely.
 ///
-/// While `PruningPredicate` will never return a `NULL` value, the


This is a nice simplification for sure 👍

alamb · 2025-02-01T14:38:31Z

datafusion/physical-optimizer/src/pruning.rs

-///           the min/max values are not known. *Note that this is different than
-///           the SQL filter semantics where `NULL` means the row is filtered
-///           out.*
-///


Can you also please update the examples in the table below to include the new predicates?

/// Here are some examples of the rewritten predicates: ...

alamb · 2025-02-01T14:39:04Z

datafusion/physical-optimizer/src/pruning.rs

+            ColumnarValue::Scalar(ScalarValue::Boolean(None)) => {
+                // this should catch bugs in tests while keeping
+                // the "old" behavior for PruningPredicate in production
+                // that if any nulls are encountered they are treated as truthy
+                #[cfg(debug_assertions)]
+                panic!("no predicate should return null!")
+            }


Maybe we can also add a warn! log warning here as well on release builds

alamb · 2025-02-01T14:43:49Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2475,7 +2587,7 @@ mod tests {
    fn row_group_predicate_eq() -> Result<()> {
        let schema = Schema::new(vec![Field::new("c1", DataType::Int32, false)]);
        let expected_expr =
-            "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1";
+            "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1)";


I am worried about turning a simple single column col <op> const predicate like c1_null_count@2 != c1_row_count into a more complex one like c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3 will make it hard for these predicates to be pushed down in some cases

At the moment the expressions look like this:

expr AND expr AND ...

And each expr is typically a simple single column col <op> constant

After this change the expressions would be like

(expr OR expr OR ...) AND (expr OR expr ...)

I thought about using IS DISTINCT FROM instead but I don't think that yields the correct value in all cases

datafusion/datafusion/expr-common/src/operator.rs

Lines 49 to 52 in 7f9a8c0

/// `IS DISTINCT FROM` (see [`distinct`])

///

/// [`distinct`]: arrow::compute::kernels::cmp::distinct

IsDistinctFrom,

alamb · 2025-02-01T15:22:38Z

datafusion/physical-optimizer/src/pruning.rs

@@ -2785,7 +2893,7 @@ mod tests {
            vec![lit(1), lit(2), lit(3)],
            false,
        ));
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)";


insta is cool for sure. Maybe you can write up a ticket describing the need.

alamb · 2025-02-06T19:42:57Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

github-actions bot added the optimizer Optimizer rules label Jan 25, 2025

adriangb commented Jan 25, 2025

View reviewed changes

adriangb force-pushed the null-checks-pruning branch from d68df19 to e0d52d0 Compare January 25, 2025 19:55

github-actions bot added the core Core DataFusion crate label Jan 25, 2025

adriangb force-pushed the null-checks-pruning branch from e0d52d0 to 4900b24 Compare January 25, 2025 19:56

fix

e3da849

adriangb force-pushed the null-checks-pruning branch from 4900b24 to e3da849 Compare January 25, 2025 20:08

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 25, 2025

adriangb mentioned this pull request Jan 30, 2025

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders apache/arrow-rs#5733

Open

appletreeisyellow approved these changes Feb 1, 2025

View reviewed changes

alamb reviewed Feb 1, 2025

View reviewed changes

alamb marked this pull request as draft February 6, 2025 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nulls checks to generated pruning predicates #14297

Add nulls checks to generated pruning predicates #14297

adriangb commented Jan 25, 2025

adriangb Jan 25, 2025

adriangb Jan 25, 2025

adriangb Jan 25, 2025

appletreeisyellow Feb 1, 2025

alamb Feb 1, 2025

adriangb commented Jan 25, 2025

alamb commented Jan 27, 2025

adriangb commented Jan 27, 2025

alamb commented Jan 30, 2025

appletreeisyellow commented Jan 30, 2025

appletreeisyellow left a comment

appletreeisyellow Feb 1, 2025

alamb Feb 1, 2025

appletreeisyellow Feb 1, 2025

appletreeisyellow Feb 1, 2025

appletreeisyellow Feb 1, 2025

appletreeisyellow Feb 1, 2025

appletreeisyellow Feb 1, 2025

alamb left a comment

alamb Feb 1, 2025

alamb Feb 1, 2025

alamb Feb 1, 2025

alamb Feb 1, 2025

alamb Feb 1, 2025

alamb commented Feb 6, 2025

	/// `IS DISTINCT FROM` (see [`distinct`])
	///
	/// [`distinct`]: arrow::compute::kernels::cmp::distinct
	IsDistinctFrom,

Add nulls checks to generated pruning predicates #14297

Are you sure you want to change the base?

Add nulls checks to generated pruning predicates #14297

Conversation

adriangb commented Jan 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jan 25, 2025

alamb commented Jan 27, 2025

adriangb commented Jan 27, 2025

alamb commented Jan 30, 2025

appletreeisyellow commented Jan 30, 2025

appletreeisyellow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 6, 2025