Filtering file: Parquet and Feather twice as fast #738

lucazanna · 2023-03-28T21:27:31Z

I look forward to moving some workload to Lance, so I ran a test for the filtering performance.
(for retrieving data by row number, Lance is way better than Parquet or Feather)

When filtering a dataset, I had this performance:

Lance avg processing time 33 seconds
Feather avg processing time 14 seconds
Parquet avg processing time 19 seconds

Is it expected that the other formats are about twice as fast for filtering?

Here is the benchmark code on Google Colab:
https://colab.research.google.com/drive/1iG1MXJV-9hrqm4YcO_MzR0Qt67sMFfEj?usp=sharing

It's the benchmark 2

changhiskhan · 2023-03-28T21:47:52Z

Thanks Luca. This very help to get concrete numbers here.

We haven't implemented stats based pruning yet in Lance, which would make the filtering path faster.
Also, what's the result set size after filtering? If you're looking for say 10 rows from 1M, Lance will enable you to read much less data. But if you're retrieving 100K from 1M, then that's much closer to a regular scan.

Lastly, not sure if there's some version mismatch but when i tried to run the colab notebook i get an error:

eddyxu · 2023-03-28T21:47:59Z

Thanks @lucazanna !

Lance is currently optimized the filter for large blob columns, for example, if you have a dataset like <image:binary, attr:int>, to run a query like SELECT image FROM .. WHERE attr > 10 and att < 50 is much faster than parquet due to the saved scans.

We have not optimized the filter / select small columns yet, as shown in your benchmark, especially a few techniques we can use to speed up.

Better compression and encoding (Support RLE encoding with point query capability #352 and some others compressions without scarify random access)
Row group level filtering (Compute min/max values for columns to support file-level and chunk-level pruning. #11)
Partition pruning ([Rust] Partition Support #458)

Since parquet uses similar technique, we expect eventually lance can reach to the similar filter performance as parquet.

lucazanna · 2023-03-29T14:57:15Z

Hi @changhiskhan and @eddyxu ,

thank you for your reply.

For the code, you are right. I had changed the code without saving it.
It should have been .sink_parquet() instead of write_parquet(). Now it's fixed

When you talk about large blob columns, I am guessing that you mean columns with a high cardinality (high number of possible values)?
Indeed in the example, the column pickup_minute only has 60 possible values.

So for example using it on latitude and longitude values (to filter the latitude/longitude values within 100 metres of a given point) then I should get better performance for Lance? (similar to the attr example)

Noted for the improvements you have planned for the filtering capabilities. You are building an exciting project, so trying to understand better what it can already do and what it will be able to do in the future

eddyxu · 2023-03-29T23:39:07Z

Hey, @lucazanna , good questions.

By large blobs, we mean that a cell of data is likely large than 1KB, i.e., an image, lidar points, or a big json blob string.

changhiskhan · 2023-07-02T22:56:53Z

Keeping this open for us to re-run benchmarks once the partitioning and stats work is complete

changhiskhan mentioned this issue Apr 5, 2023

add python script to compare lance performance vs parquet TPCH #749

Merged

changhiskhan added the benchmark label Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering file: Parquet and Feather twice as fast #738

Filtering file: Parquet and Feather twice as fast #738

lucazanna commented Mar 28, 2023

changhiskhan commented Mar 28, 2023

eddyxu commented Mar 28, 2023

lucazanna commented Mar 29, 2023

eddyxu commented Mar 29, 2023

changhiskhan commented Jul 2, 2023

Filtering file: Parquet and Feather twice as fast #738

Filtering file: Parquet and Feather twice as fast #738

Comments

lucazanna commented Mar 28, 2023

changhiskhan commented Mar 28, 2023

eddyxu commented Mar 28, 2023

lucazanna commented Mar 29, 2023

eddyxu commented Mar 29, 2023

changhiskhan commented Jul 2, 2023