New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improve Numeric matching to support full range of float64 #188

Merged

baldawar merged 6 commits into main from numbits2

Sep 19, 2024

Collaborator

baldawar commented Sep 18, 2024 •

edited

Loading

Issue #, if available:

179

Description of changes:

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.

Benchmark / Performance (for source code changes):

Benchmark                                                  (benchmark)   Mode  Cnt       Score       Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  413930.986 ± 11450.753  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  352480.174 ±  6750.778  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  418264.642 ± 19729.068  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  415142.905 ± 18319.570  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  419590.039 ± 15463.837  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  410364.141 ± 12610.649  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  382570.869 ±  5316.951  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  256233.479 ±  3662.693  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  276676.782 ±   958.215  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  274642.171 ±  5587.937  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  288814.025 ±  1302.395  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  284559.927 ±  3258.949  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  303269.268 ±   922.894  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   54274.020 ±   567.870  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6  102749.040 ±  2765.021  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   35555.689 ±   312.623  ops/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

baldawar added 5 commits

September 17, 2024 21:04


          Initial commit for supporting float64 numbers


          rev2 based on pull request diff

687efb2


          rev3 : undo bad rebase

184967e


          checkstyle errors

594b03b


          removing flaky test cases

252d9ab

baldawar marked this pull request as ready for review

September 19, 2024 00:38

timbray approved these changes

View reviewed changes

Collaborator

timbray left a comment

LGTM, a couple of nonblocking comments to look at

src/main/software/amazon/event/ruler/ComparableNumber.java Outdated

    
               * {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test

               * results show that only 5 decimal places of precision can be guaranteed when using doubles.

               * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue

               * where parsing directly to {@code Double} can loose precision when parsing doubles. It's

Collaborator

timbray Sep 19, 2024

"loose" should be "lose"

src/main/software/amazon/event/ruler/ComparableNumber.java Outdated

    
               * results show that only 5 decimal places of precision can be guaranteed when using doubles.

               * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue

               * where parsing directly to {@code Double} can loose precision when parsing doubles. It's

               * probably possible to wider ranges with our current implementation of parsing strings to

Collaborator

timbray Sep 19, 2024

grammar, "possible to wider ranges"

src/main/software/amazon/event/ruler/ComparableNumber.java

    
                          throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +

                                  " and " + ComparableNumber.HALF_TRILLION + ", inclusive");

                      // make sure we have the comparable numbers and haven't eaten up decimals values

                      if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||

Collaborator

timbray Sep 19, 2024

Can this happen? JSON doesn't allow numeric values of NaN and Inf

Collaborator Author

baldawar Sep 19, 2024

not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future.

src/main/software/amazon/event/ruler/ComparableNumber.java

    
                   * <br/>

                   * As shown in Quamina's numbits.go, it's possible to use variable length encoding

                   * to reduce storage for simple (often common) numbers but it's not done here to

                   * keep range comparisons simple for now.

Collaborator

timbray Sep 19, 2024

Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out.

Collaborator Author

baldawar Sep 19, 2024

Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later.

src/main/software/amazon/event/ruler/ComparableNumber.java

    
                   * @param value the long value to be converted

                   * @return the Base128 encoded string representation of the input value

                   */

                  public static String numbits(long value) {

Collaborator

timbray Sep 19, 2024

Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess.

Collaborator Author

baldawar Sep 19, 2024

that's fair. It would be worth a look when we dig into memory & cpu improvements down the line.


          minor javadoc fixes

e671aef

baldawar merged commit 1406b64 into main

4 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet