-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Numeric matching to support full range of float64 #188
Changes from 5 commits
9587870
687efb2
184967e
594b03b
252d9ab
e671aef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,104 +3,122 @@ | |
import ch.randelshofer.fastdoubleparser.JavaBigDecimalParser; | ||
|
||
import java.math.BigDecimal; | ||
|
||
import static software.amazon.event.ruler.Constants.BASE64_DIGITS; | ||
import static software.amazon.event.ruler.Constants.MIN_NUM_DIGIT; | ||
import java.nio.charset.StandardCharsets; | ||
import java.util.Arrays; | ||
import java.util.List; | ||
|
||
/** | ||
* Represents a number as a comparable string. | ||
* <br/> | ||
* Numbers are allowed in the range -500,000,000,000 to +500,000,000,000 (inclusive). | ||
* Comparisons are precise to 17 decimal places, with six to the right of the decimal. | ||
* Numbers are treated as floating-point values. | ||
* <br> | ||
* Numbers are converted to strings by: | ||
* 1. Multiplying by 1,000,000 to remove the decimal point and then adding 500,000,000,000 (to remove negatives), then | ||
* 2. Formatting to a 12-character to base64 string with padding, because the base64 string | ||
* converted from 500,000,000,000 * 1,000,000 = 500,000,000,000,000,000 has 12 characters. | ||
* All possible double numbers (IEEE-754 binary64 standard) are allowed. | ||
* Numbers are first standardized to floating-point values and then converted | ||
* to a Base128 encoded string of 10 bytes. | ||
* <br/> | ||
* Hexadecimal representation is used because: | ||
* 1. It saves 3 bytes of memory per number compared to decimal representation. | ||
* 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers. | ||
* 2. It aligns with the radix used for IP addresses. | ||
* We use Base128 encoding offers a compact representation of decimal numbers | ||
* as it preserves the lexicographical order of the numbers. See | ||
* https://github.com/aws/event-ruler/issues/179 for more context. | ||
* <br/> | ||
* The number is parsed as a Java {@code BigDecimal} to support decimal fractions. We're avoiding double as | ||
* there is a well-known issue that double numbers can lose precision when performing calculations involving | ||
* other data types. The higher the number, the lower the accuracy that can be maintained. For example, | ||
* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test | ||
* results show that only 5 decimal places of precision can be guaranteed when using doubles. | ||
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you provide a reference or details for that? Double handling is finicky and has lots of potential foot guns. But I never ran into a loss of precision that's not mediated by decimal-binary conversion issues. If you e.g. use Double.toHexString and the "p" representation of exponents in your numerical tests, you should be able to avoid these issues. You won't lose fidelity. But you'll gain some performance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noticed the issue with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A double represented as a decimal stops being accurate after 15 digits (unless it ends in 5). this number has 17. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And BTW this is a problem with the citylots file, those floats in there have more precision than is safe with JSON. So it's probably not useful for supporting numeric-match testing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh wow. I did not realize that. Let me look into changing city lots to have a healthy mix of numbers with less than 15 decimals. |
||
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "loose" should be "lose" |
||
* probably possible to wider ranges with our current implementation of parsing strings to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. grammar, "possible to wider ranges" |
||
* BigDecimal, but it's not worth the effort as JSON also support upto float64 range. In | ||
* case this requirement changes, it would be advisable to move away from using {@code Doubles} | ||
* and {@code Long} in this class. | ||
* <br/> | ||
* CAVEAT: | ||
* The current range of +/- 500,000,000,000 is selected as a balance between maintaining the committed 6 | ||
* decimal places of precision and memory cost (each number is parsed into a 12-character hexadecimal string). | ||
* There are precision and memory implications of the implementation here. | ||
* When trying to increase the maximum number, PLEASE BE VERY CAREFUL TO PRESERVE THE NUMBER PRECISION AND | ||
* CONSIDER THE MEMORY COST. | ||
* <br/> | ||
* Also, while {@code BigDecimal} can ensure the precision of double calculations, it has been shown to be | ||
* 2-4 times slower for basic mathematical and comparison operations, so we turn to long integer arithmetic. | ||
* This will need to be modified if we ever need to support larger numbers. | ||
*/ | ||
class ComparableNumber { | ||
// Use scientific notation to define the double number directly to avoid losing Precision by calculation | ||
// for example 5000 * 1000 *1000 will be wrongly parsed as 7.05032704E8 by computer. | ||
static final double HALF_TRILLION = 5E11; | ||
|
||
static final int MAX_LENGTH_IN_BYTES = 16; | ||
static final int MAX_DECIMAL_PRECISON = 6; | ||
static final int MAX_LENGTH_IN_BYTES = 10; | ||
static final int BASE_128_BITMASK = 0x7f; // 127 or 01111111 | ||
|
||
public static final BigDecimal TEN_E_SIX = new BigDecimal("1E6"); // to remove decimals | ||
public static final long HALF_TRILLION_TEN_E_SIX = new BigDecimal(ComparableNumber.HALF_TRILLION).multiply(TEN_E_SIX).longValueExact(); | ||
|
||
private ComparableNumber() { | ||
} | ||
private ComparableNumber() {} | ||
|
||
/** | ||
* Generates a hexadecimal string representation of a given decimal string value, | ||
* with a maximum precision of 6 decimal places and a range between -5,000,000,000 | ||
* and 5,000,000,000 (inclusive). | ||
* Generates a comparable number string from a given string representation | ||
* using numbits representation. | ||
* | ||
* @param str the decimal string value to be converted | ||
* @return the hexadecimal string representation of the input value | ||
* @throws IllegalArgumentException if the input value has more than 6 decimal places | ||
* or is outside the allowed range | ||
* @param str the string representation of the number | ||
* @return the comparable number string | ||
* @throws NumberFormatException if the input isn't a number | ||
* @throws IllegalArgumentException if the input isn't a number we can compare | ||
*/ | ||
static String generate(final String str) { | ||
final BigDecimal number = JavaBigDecimalParser.parseBigDecimal(str).stripTrailingZeros(); | ||
if (number.scale() > MAX_DECIMAL_PRECISON) { | ||
throw new IllegalArgumentException("Only values upto 6 decimals are supported"); | ||
} | ||
|
||
final long shiftedBySixDecimals = number.multiply(TEN_E_SIX).longValueExact(); | ||
final BigDecimal bigDecimal = JavaBigDecimalParser.parseBigDecimal(str); | ||
final double doubleValue = bigDecimal.doubleValue(); | ||
|
||
// faster than doing bigDecimal comparisons | ||
if (shiftedBySixDecimals < -HALF_TRILLION_TEN_E_SIX || shiftedBySixDecimals > HALF_TRILLION_TEN_E_SIX) { | ||
throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION + | ||
" and " + ComparableNumber.HALF_TRILLION + ", inclusive"); | ||
// make sure we have the comparable numbers and haven't eaten up decimals values | ||
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) || | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this happen? JSON doesn't allow numeric values of NaN and Inf There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can check both in one go based on the bits-representation. With a constant Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. JVM does inline it. I've left this as it for readability for now. |
||
BigDecimal.valueOf(doubleValue).compareTo(bigDecimal) != 0) { | ||
throw new IllegalArgumentException("Cannot compare number : " + str); | ||
} | ||
final long bits = Double.doubleToRawLongBits(doubleValue); | ||
|
||
return longToBase64Bytes(shiftedBySixDecimals + HALF_TRILLION_TEN_E_SIX); | ||
// if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning, | ||
// bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000 | ||
// bits < 0, mask = 1111111111111111111111111111111111111111111111111111111111111111 | ||
final long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L << 63); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Raph Levien proposed an even nicer way on Mastodon (not yet in Quamina afaik). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sweet. Thanks. If you have the link to the tweet (toot?) from Raph, I'd love to link it for attribution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://mastodon.online/@raph/113071041069390831 ... his was for Go, not Java - but it's the same idea |
||
return numbits(bits ^ mask ); | ||
} | ||
|
||
public static String longToBase64Bytes(long value) { | ||
if (value < 0) { | ||
throw new IllegalArgumentException("Input value must be non-negative"); | ||
/** | ||
* Converts a long value to a Base128 encoded string representation. | ||
* <br/> | ||
* The Base128 encoding scheme is a way to represent a long value as a sequence | ||
* of bytes, where each byte encodes 7 bits of the original value. This allows for | ||
* efficient storage and transmission of large numbers. | ||
* <br/> | ||
* The method first determines the number of trailing zero bytes in the input | ||
* value by iterating over the bytes from the most significant byte to the least | ||
* significant byte, and counting the number of consecutive zero bytes at the end. | ||
* It then creates a byte array of fixed length {@code MAX_LENGTH_IN_BYTES} and | ||
* populates it with the Base128 encoded bytes of the input value, starting from | ||
* the least significant byte. | ||
* <br/> | ||
* As shown in Quamina's numbits.go, it's possible to use variable length encoding | ||
* to reduce storage for simple (often common) numbers but it's not done here to | ||
* keep range comparisons simple for now. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later. |
||
* | ||
* @param value the long value to be converted | ||
* @return the Base128 encoded string representation of the input value | ||
*/ | ||
public static String numbits(long value) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's fair. It would be worth a look when we dig into memory & cpu improvements down the line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... or use the lower 15 bits. simpler than a 0x8080 mask. Still, check that you get by with only one codepoint and that it's not reserved or illegal in UTF16 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comparing ranges got trickier with using 2 7-bit bytes per char, so I've parked the idea for now. Will come back to it later. In case anyone takes a stab at it in future, here's the reference code for this :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... Tim told me quamina internally works different. Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For range comparison, Ruler has a lookup table that changes incoming numbers into a sequence of digits https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L128 . Changing this requires more effort and I suspect not as beneficial compared to supporting variable length for numbers. So I chose to work on variable length first and then come back to compressing the numbers later (though I'm hopeful that I'd be able to do both once I get more into refactoring this part of the code more) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't know the answer for this on the top of mind. Will check on it later but I was originally suspecting it should be fine as long as I address any edge cases within the look-up table part of the code. |
||
int trailingZeroes = 0; | ||
int index; | ||
// Count the number of trailing zero bytes to skip setting them | ||
for(index = MAX_LENGTH_IN_BYTES - 1; index >= 0; index--) { | ||
if((value & BASE_128_BITMASK) != 0) { | ||
break; | ||
} | ||
trailingZeroes ++; | ||
value >>= 7; | ||
} | ||
|
||
char[] bytes = new char[12]; // Maximum length of base-64 encoded long is 12 bytes | ||
int index = 11; | ||
|
||
while (value > 0) { | ||
int digit = (int) (value & 0x3F); // Get the lowest 6 bits | ||
bytes[index--] = (char) BASE64_DIGITS[digit]; | ||
value >>= 6; // Shift the value right by 6 bits | ||
} | ||
byte[] result = new byte[MAX_LENGTH_IN_BYTES]; | ||
|
||
while(index >= 0) { // left padding | ||
bytes[index--] = (char) MIN_NUM_DIGIT; | ||
// Populate the byte array with the Base128 encoded bytes of the input value | ||
for(; index >= 0; index--) { | ||
result[index] = (byte)(value & BASE_128_BITMASK); | ||
value >>= 7; | ||
} | ||
|
||
return new String(bytes); | ||
return new String(result, StandardCharsets.UTF_8); | ||
} | ||
|
||
/** | ||
* This is a utility function for debugging and tests. | ||
* Converts a given string into a list of integers, where each integer represents | ||
* the ASCII value of the corresponding character in the string. | ||
*/ | ||
static List<Integer> toIntVals(String s) { | ||
Integer[] arr = new Integer[s.length()]; | ||
for (int i=0; i<s.length(); i++) { | ||
arr[i] = (int)s.charAt(i); | ||
} | ||
return Arrays.asList(arr); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/We use/The used/