Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering file: Parquet and Feather twice as fast #738

Open
lucazanna opened this issue Mar 28, 2023 · 5 comments
Open

Filtering file: Parquet and Feather twice as fast #738

lucazanna opened this issue Mar 28, 2023 · 5 comments

Comments

@lucazanna
Copy link

I look forward to moving some workload to Lance, so I ran a test for the filtering performance.
(for retrieving data by row number, Lance is way better than Parquet or Feather)

When filtering a dataset, I had this performance:

  • Lance avg processing time 33 seconds
  • Feather avg processing time 14 seconds
  • Parquet avg processing time 19 seconds

Is it expected that the other formats are about twice as fast for filtering?

Here is the benchmark code on Google Colab:
https://colab.research.google.com/drive/1iG1MXJV-9hrqm4YcO_MzR0Qt67sMFfEj?usp=sharing

It's the benchmark 2

@changhiskhan
Copy link
Contributor

Thanks Luca. This very help to get concrete numbers here.

We haven't implemented stats based pruning yet in Lance, which would make the filtering path faster.
Also, what's the result set size after filtering? If you're looking for say 10 rows from 1M, Lance will enable you to read much less data. But if you're retrieving 100K from 1M, then that's much closer to a regular scan.

Lastly, not sure if there's some version mismatch but when i tried to run the colab notebook i get an error:
image

@eddyxu
Copy link
Contributor

eddyxu commented Mar 28, 2023

Thanks @lucazanna !

Lance is currently optimized the filter for large blob columns, for example, if you have a dataset like <image:binary, attr:int>, to run a query like SELECT image FROM .. WHERE attr > 10 and att < 50 is much faster than parquet due to the saved scans.

We have not optimized the filter / select small columns yet, as shown in your benchmark, especially a few techniques we can use to speed up.

  1. Better compression and encoding (Support RLE encoding with point query capability #352 and some others compressions without scarify random access)
  2. Row group level filtering (Compute min/max values for columns to support file-level and chunk-level pruning. #11)
  3. Partition pruning ([Rust] Partition Support #458)

Since parquet uses similar technique, we expect eventually lance can reach to the similar filter performance as parquet.

@lucazanna
Copy link
Author

Hi @changhiskhan and @eddyxu ,

thank you for your reply.

For the code, you are right. I had changed the code without saving it.
It should have been .sink_parquet() instead of write_parquet(). Now it's fixed

When you talk about large blob columns, I am guessing that you mean columns with a high cardinality (high number of possible values)?
Indeed in the example, the column pickup_minute only has 60 possible values.

So for example using it on latitude and longitude values (to filter the latitude/longitude values within 100 metres of a given point) then I should get better performance for Lance? (similar to the attr example)

Noted for the improvements you have planned for the filtering capabilities. You are building an exciting project, so trying to understand better what it can already do and what it will be able to do in the future

@eddyxu
Copy link
Contributor

eddyxu commented Mar 29, 2023

Hey, @lucazanna , good questions.

By large blobs, we mean that a cell of data is likely large than 1KB, i.e., an image, lidar points, or a big json blob string.

@changhiskhan
Copy link
Contributor

Keeping this open for us to re-run benchmarks once the partitioning and stats work is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants