-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support simple/cross lateral joins #14595
base: main
Are you sure you want to change the base?
Conversation
cc @alamb would you please help take a look as you were engaged on lateral join discussions before? Thanks :) |
1054e96
to
ab3832f
Compare
should pass all sqllogictests :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for starting this work @skyzh 🙏
Do you know of a reference / description somewhere for how to plan LATERAL joins? I am not familar with their semantics nor how we can rewrite them to subqueries.
My one smoke test was to get an explain plan in DuckDB and it produced a plan with 2 joins which seems diffrent than what this PR does 🤔
I also left some code suggestions, which are minor compared to the plan
Basically I don't know what the plans for such queries should look like so I am not sure if this code does the right thing
struct Visitor { | ||
contains: bool, | ||
} | ||
impl<'n> TreeNodeVisitor<'n> for Visitor { | ||
type Node = LogicalPlan; | ||
fn f_down(&mut self, plan: &'n LogicalPlan) -> Result<TreeNodeRecursion> { | ||
if plan.contains_outer_reference() { | ||
self.contains = true; | ||
Ok(TreeNodeRecursion::Stop) | ||
} else { | ||
Ok(TreeNodeRecursion::Continue) | ||
} | ||
} | ||
} | ||
let mut visitor = Visitor { contains: false }; | ||
plan.visit_with_subqueries(&mut visitor).unwrap(); | ||
visitor.contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this more simply using apply_subqueries
as described here: https://datafusion.apache.org/library-user-guide/query-optimizer.html#recursively-walk-an-expression-tree
Something like (untested):
fn plan_contains_outer_reference(plan: &LogicalPlan) -> bool {
let mut contains = false;
plan.apply_subqueries(|plan| {
if plan.contains_outer_reference() {
self.contains = true;
Ok(TreeNodeRecursion::Stop)
} else {
Ok(TreeNodeRecursion::Continue)
}
}).unwrap();
contains
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not equivalent to the current visitor unfortunately because apply_subqueries
only visits the subqueries 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use apply_with_subqueries()
.
if join.join_type != JoinType::Inner { | ||
return Ok(Transformed::no(plan)); | ||
} | ||
// TODO: this makes the rule to be quadratic to the number of nodes, in theory, we can build this property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can cause rewrite
to be called in a bottom up fashion by using:
https://docs.rs/datafusion/latest/datafusion/optimizer/trait.OptimizerRule.html#method.apply_order
Then you could maybe add a contains_outer_reference
type flag to DecorrelateLateralJoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
24a375d I decided to use TreeNodeRecursion::Jump
to skip all children to make it O(n).
statement ok | ||
CREATE TABLE t0(v0 BIGINT, v1 BIGINT); | ||
|
||
statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Could we please add some negative cases to these tests too such as:
- When there are two subqueries
- When the subqueries have no correlation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only added (2) in c353246, I have some test cases for (1) in the comments but I don't plan to support them right now.
Okay, I just realized that the plan is not correct. I'll need to do more transformation to make it work :( So back to the semantics of the lateral join,
is basically saying that "for every row in t0, run the lateral query", which is equivalent to the following program:
Note that sum will produce 1 row of null even when the right table doesn't contain any matching row. The produced plan with this patch is:
If t0 = (0, 0), (1, 1) and t1 = (0, 0), the output should have 2 rows, while this plan will only produce 1 row. The correct plan should be:
And aggregations like Back to the DuckDB plan. They are using the Hyper subquery unnesting. The idea of Hyper is that it will first eliminate duplicates on the join column. Therefore, in the case where the left table has a lot of duplicated rows, it only needs to compute the subquery once for all such rows (if it turns out to be a nested loop join). So what they are doing is that: eliminate the duplicates from the left table, then compute the subquery result, and finally join it back to the original one, which will be planned like:
which computes the right side row for every value in the left table. We will get (distinct t0.*, sum(t1.v1)) at this step. And then, deduplicate the result of the subquery (this step seems unnecessary?)
And finally, join the result back to the original table,
The result should be the same as the corrected plan I mentioned before. |
Thank you @skyzh -- this is an amazing description. THank you -- hopefully we can include that in the source code eventually once we work through this issue Ideally in my mind when someone found the lateral join support in DataFusion they would also have enough background via comments / links to understand what was going on. |
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
449ae6b
to
c353246
Compare
The current one-pass subquery unnesting implementation in SELECT * FROM t0, LATERAL (SELECT 1), LATERAL (SELECT * FROM t1 WHERE t0.v0 = t1.v0); Just note it down here and we can improve it in the future. We could implement a rule-based unnesting to push down the lateral join operator. |
For multiple subqueries like:
I'm not planning to support it in this patch now as I don't think the current one-pass de-correlate process can handle it correctly without significant refactor :( |
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
I think this patch will not regress existing supported cases as they get fully decorrelated with the scalar query rule and the predicate subquery rule. The patch unlocks a new set of plans produced by lateral joins, though it is possible that we could have missed some cases and generated incorrect plans for those newly-supported one. For queries that can be optimized into a physical plan: |
And ready for review again :) After trying understanding what's happening in If we are going towards making the optimizer able to unnest any subqueries and lateral joins, then we will likely have a meta rule that recursively apply the following rules top-down:
We can either use the Hyper unnesting rules (we implemented it in CMU-DB's optd optimizer) or the SQL server unnesting rules (which we've implemented in risinglight). This meta unnesting rule is more powerful than what we have right now (decorrelate predicate subquery + scalar subquery unnesting rule) and we can eventually replace these two rules with the new meta unnesting rule in the future. |
And FYI, the Hyper paper https://dl.gi.de/server/api/core/bitstreams/10f3ad20-e9ae-4876-abdf-5c8e83e4c595/content, and the SQL server paper https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf |
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Which issue does this PR close?
Rationale for this change
Add partial lateral join support.
What changes are included in this PR?
The cross lateral join logical plan is like:
We added a new rule
DecorrelateLateralJoin
to handle this case. We reuse the scalar/predicate subquery handling code for decorrelating lateral joins.Are these changes tested?
New test case in sqllogictest.
Are there any user-facing changes?
Some lateral joins are now supported. Note that we still have some of them in
joins.slt
that cannot be planned, and can be fixed gradually.