Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions api/src/main/java/org/apache/iceberg/types/Comparators.java
Original file line number Diff line number Diff line change
Expand Up @@ -321,9 +321,9 @@ private CharSeqComparator() {}
* represented using two Java characters (using UTF-16 surrogate pairs). Character by character
* comparison may yield incorrect results while comparing a 4 byte UTF-8 character to a java
* char. Character by character comparison works as expected if both characters are <= 3 byte
* UTF-8 character or both characters are 4 byte UTF-8 characters.
* isCharInUTF16HighSurrogateRange method detects a 4-byte character and considers that
* character to be lexicographically greater than any 3 byte or lower UTF-8 character.
* UTF-8 character or both characters are 4 byte UTF-8 characters. isCharHighSurrogate method
* detects a high surrogate (4-byte character) and considers that character to be
* lexicographically greater than any 3 byte or lower UTF-8 character.
*/
@Override
public int compare(CharSequence s1, CharSequence s2) {
Expand Down
24 changes: 22 additions & 2 deletions api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ public static Literal<CharSequence> truncateStringMax(Literal<CharSequence> inpu
for (int i = length - 1; i >= 0; i--) {
// Get the offset in the truncated string buffer where the number of unicode characters = i
int offsetByCodePoint = truncatedStringBuilder.offsetByCodePoints(0, i);
int nextCodePoint = truncatedStringBuilder.codePointAt(offsetByCodePoint) + 1;
int nextCodePoint = incrementCodePoint(truncatedStringBuilder.codePointAt(offsetByCodePoint));
// No overflow
if (nextCodePoint != 0 && Character.isValidCodePoint(nextCodePoint)) {
if (nextCodePoint != 0) {
truncatedStringBuilder.setLength(offsetByCodePoint);
// Append next code point to the truncated substring
truncatedStringBuilder.appendCodePoint(nextCodePoint);
Expand All @@ -93,4 +93,24 @@ public static Literal<CharSequence> truncateStringMax(Literal<CharSequence> inpu
}
return null; // Cannot find a valid upper bound
}

private static int incrementCodePoint(int codePoint) {
// surrogate code points are not Unicode scalar values,
// any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
// see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288
Preconditions.checkArgument(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor point here, but shouldn't this only be relevant if we somehow get non-unicode binary in a unicode string? Shouldn't be possible in a Java string right?

codePoint < Character.MIN_SURROGATE || codePoint > Character.MAX_SURROGATE,
"invalid code point: %s",
codePoint);

if (codePoint == Character.MIN_SURROGATE - 1) {
// increment to the next Unicode scalar value
return Character.MAX_SURROGATE + 1;
} else if (codePoint == Character.MAX_CODE_POINT) {
// overflow
return 0;
} else {
return codePoint + 1;
}
}
}
13 changes: 13 additions & 0 deletions core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java
Original file line number Diff line number Diff line change
Expand Up @@ -274,4 +274,17 @@ public void testTruncateStringMax() {
"Test input with multiple 4 byte UTF-8 character where the first unicode character should be incremented")
.isEqualTo(0);
}

@Test
public void testTruncateStringMaxUpperBound() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add these to the test above? I'm also fine if there is a specific reason to have them somewhere else but it seems like these would fit into the test above as just other examples. The test cases for +1 and MAX_CODE_POINT are there already right?

String max = "abcdefghigklmno" + (char) (Character.MIN_SURROGATE - 1) + "p";
String expectedUpper = "abcdefghigklmno" + (char) (Character.MAX_SURROGATE + 1);
Comparator<CharSequence> cmp = Literal.of(max).comparator();
CharSequence truncatedUpper = truncateStringMax(Literal.of(max), 16).value();
assertThat(cmp.compare(truncatedUpper, max))
.as("Truncated upper bound should be greater than the input max")
.isGreaterThan(1);

assertThat(truncatedUpper).usingComparator(cmp).isEqualTo(expectedUpper);
}
}