Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Sep 18, 2024

We encountered an exception while writing data, and the stack trace is as follows. It occurred during the collection of Parquet column metrics:

Exception stack:

Suppressed: org.apache.iceberg.exceptions.RuntimeIOException: Failed to encode value as UTF-8: Ҋ�Qڞ<֔~�M�ECڮV?
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:110)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:83)
at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:343)
at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:174)
at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:86)
at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:166)
at org.apache.iceberg.io.DataWriter.close(DataWriter.java:100)
at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82)
at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74)
at org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31)
at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.close(SparkWrite.java:1162)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$9(WriteToDataSourceV2Exec.scala:423)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
... 10 more
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)
... 25 more
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)

Investigation

After some investigation, I found that when collecting Parquet column metrics, string metrics are truncated by default to a length of 16 characters. When truncating the max metric, if the truncated length is less than the length of the original max value, the last character will be incremented by 1 to ensure that the truncated value is greater than the max value. However, this increment operation does not consider skipping illegal UTF-8 Unicode code points, which may lead to the following exception.

In the scenario where we encountered this issue, there is a Parquet file with a column's max metric length exceeding 16, and the code point of its 16th character is '\uD7FF', which is Character.MIN_SURROGATE - 1. Adding 1 to this resulted in Character.MIN_SURROGATE, which is not a valid Unicode scalar value. Therefore, when Conversions.toByteBuffer attempted to encode it in UTF-8 format, a MalformedInputException was thrown.

This fix specifically skips illegal code points when incrementing the last character to avoid this issue.

To reproduce

CREATE TABLE my_table (data string) using iceberg;
INSERT INTO my_table VALUES('abcdefghigklmno\uD7FFp');

@zhongyujiang
Copy link
Contributor Author

@amogh-jahagirdar @nastra can you please help review this? thanks.

// surrogate code points are not Unicode scalar values,
// any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
// see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288
Preconditions.checkArgument(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor point here, but shouldn't this only be relevant if we somehow get non-unicode binary in a unicode string? Shouldn't be possible in a Java string right?

@@ -274,4 +274,17 @@ public void testTruncateStringMax() {
"Test input with multiple 4 byte UTF-8 character where the first unicode character should be incremented")
.isEqualTo(0);
}

@Test
public void testTruncateStringMaxUpperBound() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add these to the test above? I'm also fine if there is a specific reason to have them somewhere else but it seems like these would fit into the test above as just other examples. The test cases for +1 and MAX_CODE_POINT are there already right?

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few nits on this, but I this makes sense to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants