Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Question: Too much memory to write parquet files #412

Open
ycyang-26 opened this issue Nov 22, 2022 · 5 comments
Open

Question: Too much memory to write parquet files #412

ycyang-26 opened this issue Nov 22, 2022 · 5 comments

Comments

@ycyang-26
Copy link

ycyang-26 commented Nov 22, 2022

When I write a new parquet file(about 100 MB data before compression), 1GB of memory is requested. I'm wondering that why it takes up too much memory.
I use the following method to write file and the struct of data is as bellow:

err := parquet.Write[*model.x](data, rawParquetData, compressionType)

type x struct {
	x    int    `parquet:"x,delta"`
	x    string `parquet:"x,dict"`
	x    string `parquet:"x,dict"`
	x    string `parquet:"x"`
	x    int    `parquet:"x"`
	x    int    `parquet:"x"`
	x   string `parquet:"x,dict"`
	x   string `parquet:"x,dict"`
	x   string `parquet:"x"`
	x   int    `parquet:"x,dict"`
	x   []string `parquet:"x,list"`
	x   string   `parquet:"x,dict"`
	x   []string `parquet:"x,list"`
}
@kevinburkesegment
Copy link
Contributor

Are you familiar with using pprof to profile a Go program? The first thing we would do is try to reproduce the results you're seeing and then analyze the quantity and size of memory allocations, but given you can reproduce it consistently, it may be easier for you to generate a profile. https://pkg.go.dev/runtime/pprof

@ycyang-26
Copy link
Author

ycyang-26 commented Nov 22, 2022

Are you familiar with using pprof to profile a Go program? The first thing we would do is try to reproduce the results you're seeing and then analyze the quantity and size of memory allocations, but given you can reproduce it consistently, it may be easier for you to generate a profile. https://pkg.go.dev/runtime/pprof

Actually, I have used pprof to analyze the program. The alloc space of parquet.Write is as follows. The alloc space of red file is 700MB.
image

@vbmithr
Copy link

vbmithr commented Nov 22, 2022

Had similar issues, and were never able to determine where the memory went even with those traces. Never managed to fix them, either. See #118

Indeed playing with GOGC environment variables did seem to help a bit, which indicate maybe that the problem lies in how the go runtime deal with garbage collection rather than this library. Definitely not an expert in this. But eventually, the amount of RAM needed was on the order of magnitude of the combined size of all the data (uncompressed!) that had to go in the file, whereas I thought that you could theoretically write a parquet file using very little memory by flushing things often.

@kevinburkesegment
Copy link
Contributor

Thank you, that's really helpful.

@achille-roussel
Copy link

I believe the issue may come from using append, which stops exponentially increasing the slice capacity for large slices (around 1MiB if I remember correctly). This results in reallocating memory buffers that grow very slowly, greatly increasing the memory footprint.

We could try modifying the plain.AppendByteArrayString method to always manually grow the slice capacity by 2x, which would better amortize.

I'm also curious whether you are calling parquet.Write repeatedly in your application (e.g. to produce multiple parquet files). If that's the case, you might be able to gain much greater memory efficiency by reusing a parquet.GenericWriter instead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants