You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you create a parquet file from this the field will be transformed into a struct with a memberX for each type of the union field<struct<member0:int, member1:string, member2:boolean>>
Then when writing data in Hudi, in 0.X, the process fails because it will take only the first type and then try to write the struct into the type selected, in our case you will get something like : java.lang.IllegalArgumentException: StructType(StructField(member0,IntegerType,true),StructField(member1,StringType,true),StructField(member2,BooleanType,true)) and IntegerType are incompatible:
This behavior was a bit opaque in 0.x but in 1.0.0 this has been made pretty clear here
So each time we encounter parquet files containing fields with union type we need to pre-process the data as a inelegant workaround (renaming the memberX fields to avoid the union type detection)
Knowing that the parquet implementation allows this union type and Avro as well, we could expect Hudi to be able to handle it in one or other way (representing the member struct as it is ?). Wdyt ?
The text was updated successfully, but these errors were encountered:
@gsanon As we see the same also in code, it was never supported. With 0.x also we are only using the first one but we didn't had validation check which was causing the issue.
I agree it can be supported, Feel free to contribute if you are interested. Thanks.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Currently this is not possible to write Hudi data if the source DF contains a field with union type.
Let's say you have an avro schema with an union type field :
If you create a parquet file from this the field will be transformed into a struct with a memberX for each type of the union
field<struct<member0:int, member1:string, member2:boolean>>
Then when writing data in Hudi, in 0.X, the process fails because it will take only the first type and then try to write the struct into the type selected, in our case you will get something like :
java.lang.IllegalArgumentException: StructType(StructField(member0,IntegerType,true),StructField(member1,StringType,true),StructField(member2,BooleanType,true)) and IntegerType are incompatible:
This behavior was a bit opaque in 0.x but in 1.0.0 this has been made pretty clear here
So each time we encounter parquet files containing fields with union type we need to pre-process the data as a inelegant workaround (renaming the
memberX
fields to avoid the union type detection)Knowing that the parquet implementation allows this union type and Avro as well, we could expect Hudi to be able to handle it in one or other way (representing the member struct as it is ?). Wdyt ?
The text was updated successfully, but these errors were encountered: