You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since #565 was merged, reordering join conditions will no longer produce an incorrect query plan, but it could force an unnecessary shuffle (which happens in the referenced integration test). When one side of a join is already partitioned on the join attributes, and the other input is not partitioned, the generated shuffle for the unpartitioned input should ensure its partitioning order is compatible with the already-partitioned input. If we allow the order of join conditions to determine the order of partitioning attributes (which we do now), then this will generally not be the case, and we will generate an unnecessary shuffle for the already-partitioned input. Since the ordering of partitioning attributes has no user-visible significance, there is no reason to let it be determined by the order of conditions in a query.
The text was updated successfully, but these errors were encountered:
The fix for this should be simple to test without needing to extend FakeDB to include physical representation, just by generating the query plan for the second query in uwescience/myria#905 and verifying it does not contain a shuffle above the already-partitioned table's scan.
Perhaps we could entirely avoid the need for ensuring compatible ordering by changing our hashing method to be independent of attribute order. One simple approach would be to separately hash each attribute, then XOR the results. This would be incompatible with existing hash-partitioned relations, of course (if updated code were applied to an existing cluster). This approach seems much simpler than trying to ensure a canonical ordering of attributes (which is impossible in general since different relations may not share the names of corresponding attributes in their join keys).
Actually XOR is probably not a great choice since any key with each of its values repeated an even number of times would hash to 0. We only need the hash function to be commutative (not an involution like XOR), so something like sum of all hashes modulo a large prime should work better.
Since #565 was merged, reordering join conditions will no longer produce an incorrect query plan, but it could force an unnecessary shuffle (which happens in the referenced integration test). When one side of a join is already partitioned on the join attributes, and the other input is not partitioned, the generated shuffle for the unpartitioned input should ensure its partitioning order is compatible with the already-partitioned input. If we allow the order of join conditions to determine the order of partitioning attributes (which we do now), then this will generally not be the case, and we will generate an unnecessary shuffle for the already-partitioned input. Since the ordering of partitioning attributes has no user-visible significance, there is no reason to let it be determined by the order of conditions in a query.
The text was updated successfully, but these errors were encountered: