Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate shuffles compatible with existing partitioning order #566

Open
senderista opened this issue Jul 20, 2017 · 3 comments
Open

Generate shuffles compatible with existing partitioning order #566

senderista opened this issue Jul 20, 2017 · 3 comments
Assignees

Comments

@senderista
Copy link
Contributor

Since #565 was merged, reordering join conditions will no longer produce an incorrect query plan, but it could force an unnecessary shuffle (which happens in the referenced integration test). When one side of a join is already partitioned on the join attributes, and the other input is not partitioned, the generated shuffle for the unpartitioned input should ensure its partitioning order is compatible with the already-partitioned input. If we allow the order of join conditions to determine the order of partitioning attributes (which we do now), then this will generally not be the case, and we will generate an unnecessary shuffle for the already-partitioned input. Since the ordering of partitioning attributes has no user-visible significance, there is no reason to let it be determined by the order of conditions in a query.

@senderista
Copy link
Contributor Author

The fix for this should be simple to test without needing to extend FakeDB to include physical representation, just by generating the query plan for the second query in uwescience/myria#905 and verifying it does not contain a shuffle above the already-partitioned table's scan.

@senderista
Copy link
Contributor Author

Perhaps we could entirely avoid the need for ensuring compatible ordering by changing our hashing method to be independent of attribute order. One simple approach would be to separately hash each attribute, then XOR the results. This would be incompatible with existing hash-partitioned relations, of course (if updated code were applied to an existing cluster). This approach seems much simpler than trying to ensure a canonical ordering of attributes (which is impossible in general since different relations may not share the names of corresponding attributes in their join keys).

@senderista
Copy link
Contributor Author

Actually XOR is probably not a great choice since any key with each of its values repeated an even number of times would hash to 0. We only need the hash function to be commutative (not an involution like XOR), so something like sum of all hashes modulo a large prime should work better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant