Problem when using optimized Reader #396

tom-slb · 2022-10-31T14:21:53Z

Hi, I'm trying to optimize reading from Parquet files as outlined here: https://github.com/segmentio/parquet-go#optimizing-reads
I am using a schema-less reading approach, so there are no data classes or schemas defined in my application. I am trying to read columns (pages) of data in the form of Go slices, e.g. []float64.

My problem is this: In my parquet file, all columns are defined as optional, so I'm getting *parquet.optionalPageValues when reading a page. This does not implement DoubleReader and there does not seem to be any way to get the underlying "base" ValueReader (which supposedly is a DoubleReader). So at present, it is not possible to use optimized reads into Go slices directly.

Are there any overrides that could be used, for example to ignore that the columns are optional in parquet?

Many thanks.

achille-roussel · 2022-10-31T16:43:42Z

Hello @tom-slb, thanks for opening the issue.

This is a known limitation of the current APIs, it only works for required columns, not optional nor repeated.

We have plans to revisit these APIs but nothing has yet been implemented.

hhoughgg · 2022-10-31T21:52:40Z

Is the same issue true for optimizing writes? Seems that its not possible to write optional fields even with parquet.Value since the underlying type of the ColumnBuffers for an int64 is []int64 instead of []*int64.

achille-roussel self-assigned this Oct 31, 2022

achille-roussel added the question Further information is requested label Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when using optimized Reader #396

Problem when using optimized Reader #396

tom-slb commented Oct 31, 2022

achille-roussel commented Oct 31, 2022

hhoughgg commented Oct 31, 2022

Problem when using optimized Reader #396

Problem when using optimized Reader #396

Comments

tom-slb commented Oct 31, 2022

achille-roussel commented Oct 31, 2022

hhoughgg commented Oct 31, 2022