Add more examples for reading parquet files #487

cmackenzie1 · 2023-03-17T04:42:28Z

I am looking for more guidance around how to read parquet files, especially when reading dynamic parquet files without a corresponding Go struct to read the data into. I was looking at some of the unit tests to see examples, but wasn't able to find many that didn't use a Go struct to read data into first before accessing values.

So far I've got the following:

reader := parquet.NewGenericReader[any](bytes.NewReader(data))
defer reader.Close()

schema := reader.Schema()

rows := make([]parquet.Row, reader.NumRows())
n, err := reader.ReadRows(rows)
if err != nil && !errors.Is(err, io.EOF) {
    return err
}

for _, row := range rows {
        values := parquet.Value(row)
        // known columns can be accessed using 
        schema.Lookup([]string{"id"}) 
        // and nested columns like 
        schema.Lookup([]string{"a", "b"})
}

Is that the best way to read them? Are there already existing methods to determine if one column path is a subset of another?

ljluestc · 2023-11-15T04:18:48Z

package main

import (
    "bytes"
    "fmt"
    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/parquet"
    "github.com/xitongsys/parquet-go/tool/schematool"
    "io"
)

func main() {
    // Assuming you have a Parquet file stored in a byte slice named 'data'
    reader, err := local.NewLocalFileReader(bytes.NewReader(data))
    if err != nil {
        fmt.Println("Error opening Parquet file:", err)
        return
    }
    defer reader.Close()

    // Create a Parquet reader
    pReader, err := parquet.NewParquetReader(reader, new(MySchema), 4)
    if err != nil {
        fmt.Println("Error creating Parquet reader:", err)
        return
    }
    defer pReader.ReadStop()

    // Read rows into a slice of MySchema (you can create your custom schema)
    var rows []MySchema
    for {
        if err = pReader.Read(&rows); err == io.EOF {
            break
        }
        if err != nil {
            fmt.Println("Error reading rows:", err)
            return
        }
    }

    // Now, you can access the data
    for _, row := range rows {
        fmt.Println("ID:", row.ID)
        fmt.Println("A.B:", row.A.B)
        // Access other fields as needed
    }
}

// Define a custom schema (MySchema) that matches the structure of your data
type MySchema struct {
    ID int32 `parquet:"name=id, type=INT32"`
    A  struct {
        B int32 `parquet:"name=b, type=INT32"`
    } `parquet:"name=a, repetitiontype=REQUIRED"`
}

func init() {
    // Register the schema with Parquet for marshaling and unmarshaling
    parquet.RegisterSchemaFromStruct("MySchema", reflect.TypeOf(MySchema{}))
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more examples for reading parquet files #487

Add more examples for reading parquet files #487

cmackenzie1 commented Mar 17, 2023

ljluestc commented Nov 15, 2023

Add more examples for reading parquet files #487

Add more examples for reading parquet files #487

Comments

cmackenzie1 commented Mar 17, 2023

ljluestc commented Nov 15, 2023