You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
type ParquetTestSchemaObj struct {
Field1 string
Field2 int
Nested Nested
NestedSlice []Nested
}
type Nested struct {
Name string
Age int
}
nestedCol := []Nested{{"jimmy", 50}, {"billy", 10}, {"bobby", 99}, {"tommy", 78}}
strCol := []string{"a", "b", "c", "d"}
intCol := int[1, 2, 3, 4]
strColByteArray := make([]byte, 8)
ps := parquet.SchemaOf(ParquetTestSchemaObj{})
b := parquet.NewBuffer(ps)
pCols := b.ColumnBuffers()
if _, err := pCols[0].(parquet.ByteArrayWriter).WriteByteArrays(ConvertSliceStringToParquetByteArray(strCol)); err != nil {
return err
}
I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.
Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!
The text was updated successfully, but these errors were encountered:
Edit: Are you suggesting to use multiple buffers? Perhaps I can just pull the columns out of each one and concat them into a final buffer? The struct type is build with reflection so in my case I won't have the actual type for generics.
Ah sorry I think my explanation was poor. I have some structs I want to write as parquet group type (multiple columns) along with other columns that are just for example int64 etc. I have some custom types that can be written directly and others that cannot such as MyCustomColumn3.
The below columns would all end up as one parquet file
MyCustomColumn1 []int64
MyCustomColumn2 []string
MyCustomColumn3 []struct{
A string
B int
}
I would have a parquet column buffer after schemaOf of 4 columns but I only have three columns in this case. Is there some existing function to convert MyCustomColumn3 into the two columns I need to write? My assumption was that the row/struct writer has some deconstruction columns that will convert []struct to multiple columns.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.
Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!
The text was updated successfully, but these errors were encountered: