Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DictionaryEncodedColumnPartSerde: Read values using smooshReader. #17688

Closed
wants to merge 1 commit into from

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 30, 2025

The write side uses a FileSmoosher for valueWriter, which means that the value component may be split over multiple files. In this case, the read side needs a SmooshedFileMapper.

It is plausible for multi-valued columns to have value sections in excess of 2 GB: imagine a segment with 10 million rows and an average of 50 values per row. The value section would have 500 million ints, which is 2 GB.

I believe this is the cause of #7943.

The write side uses a FileSmoosher for valueWriter, which means that the
value component may be split over multiple files. In this case, the
read side needs a SmooshedFileMapper.

It is plausible for multi-valued columns to have value sections in excess
of 2 GB: imagine a segment with 10 million rows and an average of 50 values
per row. The value section would have 500 million ints, which is 2 GB.
@gianm
Copy link
Contributor Author

gianm commented Jan 31, 2025

Marking draft since the patch should have a test.

@gianm gianm marked this pull request as draft January 31, 2025 00:00
@gianm
Copy link
Contributor Author

gianm commented Jan 31, 2025

Btw, I think the same basic problem exists in BlockLayoutColumnarDoublesSupplier. The corresponding writer can write multi-file, but the reader can only read single-file. Probably exists other places too. We should scan for all of them and come up with a way to test them all.

@gianm
Copy link
Contributor Author

gianm commented Jan 31, 2025

Closing in favor of #17691.

@gianm gianm closed this Jan 31, 2025
@gianm gianm deleted the fix-large-value-sections branch January 31, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant