-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering file: Parquet and Feather twice as fast #738
Comments
Thanks @lucazanna ! Lance is currently optimized the filter for large blob columns, for example, if you have a dataset like We have not optimized the filter / select small columns yet, as shown in your benchmark, especially a few techniques we can use to speed up.
Since parquet uses similar technique, we expect eventually lance can reach to the similar filter performance as parquet. |
Hi @changhiskhan and @eddyxu , thank you for your reply. For the code, you are right. I had changed the code without saving it. When you talk about large blob columns, I am guessing that you mean columns with a high cardinality (high number of possible values)? So for example using it on latitude and longitude values (to filter the latitude/longitude values within 100 metres of a given point) then I should get better performance for Lance? (similar to the attr example) Noted for the improvements you have planned for the filtering capabilities. You are building an exciting project, so trying to understand better what it can already do and what it will be able to do in the future |
Hey, @lucazanna , good questions. By |
Keeping this open for us to re-run benchmarks once the partitioning and stats work is complete |
I look forward to moving some workload to Lance, so I ran a test for the filtering performance.
(for retrieving data by row number, Lance is way better than Parquet or Feather)
When filtering a dataset, I had this performance:
Is it expected that the other formats are about twice as fast for filtering?
Here is the benchmark code on Google Colab:
https://colab.research.google.com/drive/1iG1MXJV-9hrqm4YcO_MzR0Qt67sMFfEj?usp=sharing
It's the benchmark 2
The text was updated successfully, but these errors were encountered: