Write Structs to ColumnBuffer #385

hhoughgg · 2022-10-24T18:48:14Z

type ParquetTestSchemaObj struct {
	Field1      string
	Field2      int
	Nested      Nested
	NestedSlice []Nested
}
type Nested struct {
	Name string
	Age  int
}

nestedCol := []Nested{{"jimmy", 50}, {"billy", 10}, {"bobby", 99}, {"tommy", 78}}
strCol := []string{"a", "b", "c", "d"}
intCol := int[1, 2, 3, 4]

strColByteArray := make([]byte, 8)

ps := parquet.SchemaOf(ParquetTestSchemaObj{})
b := parquet.NewBuffer(ps)
pCols := b.ColumnBuffers()

if _, err := pCols[0].(parquet.ByteArrayWriter).WriteByteArrays(ConvertSliceStringToParquetByteArray(strCol)); err != nil {
	return err
}

I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.

Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!

The text was updated successfully, but these errors were encountered:

achille-roussel · 2022-10-25T20:22:41Z

Hello @hhoughgg, thanks for starting this conversation!

Would this code snippet be helpful to highlight how to write struct values to a parquet buffer?

rows := []ParquetTestSchemaObj{
  ...
}

buffer := parquet.NewGenericBuffer[ParquetTestSchemaObj]()
buffer.Write(rows)

hhoughgg · 2022-10-25T21:03:45Z

Edit: Are you suggesting to use multiple buffers? Perhaps I can just pull the columns out of each one and concat them into a final buffer? The struct type is build with reflection so in my case I won't have the actual type for generics.

Ah sorry I think my explanation was poor. I have some structs I want to write as parquet group type (multiple columns) along with other columns that are just for example int64 etc. I have some custom types that can be written directly and others that cannot such as MyCustomColumn3.

The below columns would all end up as one parquet file

MyCustomColumn1 []int64
MyCustomColumn2 []string
MyCustomColumn3 []struct{
A string
B int
}

I would have a parquet column buffer after schemaOf of 4 columns but I only have three columns in this case. Is there some existing function to convert MyCustomColumn3 into the two columns I need to write? My assumption was that the row/struct writer has some deconstruction columns that will convert []struct to multiple columns.

achille-roussel self-assigned this Oct 25, 2022

achille-roussel added the question Further information is requested label Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Structs to ColumnBuffer #385

Write Structs to ColumnBuffer #385

hhoughgg commented Oct 24, 2022 •

edited

Loading

achille-roussel commented Oct 25, 2022

hhoughgg commented Oct 25, 2022 •

edited

Loading

Write Structs to ColumnBuffer #385

Write Structs to ColumnBuffer #385

Comments

hhoughgg commented Oct 24, 2022 • edited Loading

achille-roussel commented Oct 25, 2022

hhoughgg commented Oct 25, 2022 • edited Loading

hhoughgg commented Oct 24, 2022 •

edited

Loading

hhoughgg commented Oct 25, 2022 •

edited

Loading