Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Write Structs to ColumnBuffer #385

Open
hhoughgg opened this issue Oct 24, 2022 · 2 comments
Open

Write Structs to ColumnBuffer #385

hhoughgg opened this issue Oct 24, 2022 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@hhoughgg
Copy link
Contributor

hhoughgg commented Oct 24, 2022

type ParquetTestSchemaObj struct {
	Field1      string
	Field2      int
	Nested      Nested
	NestedSlice []Nested
}
type Nested struct {
	Name string
	Age  int
}

nestedCol := []Nested{{"jimmy", 50}, {"billy", 10}, {"bobby", 99}, {"tommy", 78}}
strCol := []string{"a", "b", "c", "d"}
intCol := int[1, 2, 3, 4]

strColByteArray := make([]byte, 8)

ps := parquet.SchemaOf(ParquetTestSchemaObj{})
b := parquet.NewBuffer(ps)
pCols := b.ColumnBuffers()

if _, err := pCols[0].(parquet.ByteArrayWriter).WriteByteArrays(ConvertSliceStringToParquetByteArray(strCol)); err != nil {
	return err
}

I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.

Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!

@achille-roussel
Copy link

Hello @hhoughgg, thanks for starting this conversation!

Would this code snippet be helpful to highlight how to write struct values to a parquet buffer?

rows := []ParquetTestSchemaObj{
  ...
}

buffer := parquet.NewGenericBuffer[ParquetTestSchemaObj]()
buffer.Write(rows)

@achille-roussel achille-roussel self-assigned this Oct 25, 2022
@achille-roussel achille-roussel added the question Further information is requested label Oct 25, 2022
@hhoughgg
Copy link
Contributor Author

hhoughgg commented Oct 25, 2022

Edit: Are you suggesting to use multiple buffers? Perhaps I can just pull the columns out of each one and concat them into a final buffer? The struct type is build with reflection so in my case I won't have the actual type for generics.

Ah sorry I think my explanation was poor. I have some structs I want to write as parquet group type (multiple columns) along with other columns that are just for example int64 etc. I have some custom types that can be written directly and others that cannot such as MyCustomColumn3.

The below columns would all end up as one parquet file

MyCustomColumn1 []int64
MyCustomColumn2 []string
MyCustomColumn3 []struct{
A string
B int
}

I would have a parquet column buffer after schemaOf of 4 columns but I only have three columns in this case. Is there some existing function to convert MyCustomColumn3 into the two columns I need to write? My assumption was that the row/struct writer has some deconstruction columns that will convert []struct to multiple columns.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants