Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Numeric matching to support full range of float64 #188

Merged
merged 6 commits into from
Sep 19, 2024
Merged

Conversation

baldawar
Copy link
Collaborator

@baldawar baldawar commented Sep 18, 2024

Issue #, if available:

179

Description of changes:

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.

Benchmark / Performance (for source code changes):

Benchmark                                                  (benchmark)   Mode  Cnt       Score       Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  413930.986 ± 11450.753  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  352480.174 ±  6750.778  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  418264.642 ± 19729.068  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  415142.905 ± 18319.570  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  419590.039 ± 15463.837  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  410364.141 ± 12610.649  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  382570.869 ±  5316.951  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  256233.479 ±  3662.693  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  276676.782 ±   958.215  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  274642.171 ±  5587.937  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  288814.025 ±  1302.395  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  284559.927 ±  3258.949  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  303269.268 ±   922.894  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   54274.020 ±   567.870  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6  102749.040 ±  2765.021  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   35555.689 ±   312.623  ops/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@baldawar baldawar marked this pull request as ready for review September 19, 2024 00:38
Copy link
Collaborator

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of nonblocking comments to look at

* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"loose" should be "lose"

* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's
* probably possible to wider ranges with our current implementation of parsing strings to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar, "possible to wider ranges"

throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
" and " + ComparableNumber.HALF_TRILLION + ", inclusive");
// make sure we have the comparable numbers and haven't eaten up decimals values
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen? JSON doesn't allow numeric values of NaN and Inf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future.

* <br/>
* As shown in Quamina's numbits.go, it's possible to use variable length encoding
* to reduce storage for simple (often common) numbers but it's not done here to
* keep range comparisons simple for now.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later.

* @param value the long value to be converted
* @return the Base128 encoded string representation of the input value
*/
public static String numbits(long value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. It would be worth a look when we dig into memory & cpu improvements down the line.

@baldawar baldawar merged commit 1406b64 into main Sep 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants