Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQE: Add quantile aggregation #10755

Merged
merged 13 commits into from
Mar 7, 2025
Merged

Conversation

jhesketh
Copy link
Contributor

What this PR does

Which issue(s) this PR fixes or relates to

#10067

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@jhesketh jhesketh requested a review from a team as a code owner February 27, 2025 05:16
//
// values will be sorted in place.
// If values has zero elements, NaN is returned.
// If q==NaN, NaN is returned.
// If q<0, -Inf is returned.
// If q>1, +Inf is returned.
func quantile(q float64, values []float64) float64 {
func Quantile(q float64, values []float64) float64 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this is exported for now for simplicity. I will come back with a new PR to sync this file with upstream and move it someplace more common.

@jhesketh jhesketh force-pushed the jhesketh/mqe-quantile branch from 903db70 to 6b8094e Compare February 27, 2025 23:48
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I'd like to see a benchmark comparison between this and Prometheus' engine.

@@ -218,7 +232,7 @@ func (a *Aggregation) NextSeries(ctx context.Context) (types.InstantVectorSeries
}

// Construct the group and return it
seriesData, hasMixedData, err := thisGroup.aggregation.ComputeOutputSeries(a.TimeRange, a.MemoryConsumptionTracker)
seriesData, hasMixedData, err := thisGroup.aggregation.ComputeOutputSeries(a.paramData, a.TimeRange, a.MemoryConsumptionTracker, a.emitAnnotationParam)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will cause an allocation for a.emitAnnotationParam on every call - we may need to do the same thing we do for a.emitAnnotationFunc / a.emitAnnotation in NewAggregation.

Another option would be to separate out the validation of the parameter and do that just once, rather than in every call to ComputeOutputSeries - this would save us doing the same work over and over again, and remove the need to pass the function here as well. And then we can likely just accept the single allocation and pass a.emitAnnotationParam directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will cause an allocation for a.emitAnnotationParam on every call - we may need to do the same thing we do for a.emitAnnotationFunc / a.emitAnnotation in NewAggregation.

Good catch, have fixed that.

Another option would be to separate out the validation of the parameter and do that just once, rather than in every call to ComputeOutputSeries - this would save us doing the same work over and over again, and remove the need to pass the function here as well. And then we can likely just accept the single allocation and pass a.emitAnnotationParam directly.

The parameter is processed during the aggregations SeriesMetadata, so it should only be once. The values (a.paramData on this quoted line) are then passed into the quantile ComputeOutputSeries to be looked up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is processed during the aggregations SeriesMetadata, so it should only be once. The values (a.paramData on this quoted line) are then passed into the quantile ComputeOutputSeries to be looked up.

Sorry, should have been clearer: I was referring to the work done to validate the parameter values and emit an annotation if it is not between 0 and 1 here - this will be repeated for every output group, and is the only reason we need to pass a emitParamAnnotationFunc to ComputeOutputSeries.

If we instead move that validation out of ComputeOutputSeries, then we'll only do it once regardless of the number of groups, and we don't need to pass emitParamAnnotationFunc to ComputeOutputSeries either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Good point.

I've refactored things to avoid this. Now Quantile is it's own operator that is a thing wrapper around Aggregation. This is to avoid repeating all of the grouping logic. The downside is that Aggregation still needs to keep ParamData, which is a bit odd, but I think a decent trade off.

As a side effect: Previously if no series were output, no annotation would be emitted for a bad quantile value. However since we process them independently we now always output an annotation. This is more consistent with Prometheus anyway.

@jhesketh
Copy link
Contributor Author

jhesketh commented Mar 4, 2025

Benchmark results:


                                                                   │  Prometheus  │               Mimir                │
                                                                   │    sec/op    │   sec/op     vs base               │
Query/quantile(0.9,_a_1),_instant_query-8                             674.2µ ± 1%   639.4µ ± 1%   -5.16% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_100_steps-8                742.9µ ± 1%   689.5µ ± 1%   -7.18% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_1000_steps-8               1.172m ± 0%   1.040m ± 1%  -11.28% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_instant_query-8                           2.285m ± 1%   2.202m ± 1%   -3.64% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_100_steps-8              5.167m ± 1%   3.774m ± 1%  -26.97% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_1000_steps-8             31.71m ± 2%   18.94m ± 1%  -40.29% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_instant_query-8                          26.65m ± 1%   25.56m ± 1%   -4.11% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_100_steps-8             85.56m ± 1%   60.01m ± 1%  -29.86% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_1000_steps-8            614.7m ± 1%   377.9m ± 1%  -38.52% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_instant_query-8                     686.7µ ± 1%   649.9µ ± 1%   -5.36% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_100_steps-8        748.0µ ± 1%   699.5µ ± 0%   -6.49% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_1000_steps-8       1.189m ± 1%   1.048m ± 1%  -11.79% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_instant_query-8                   2.389m ± 1%   2.403m ± 1%        ~ (p=0.132 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_100_steps-8      5.340m ± 2%   5.320m ± 1%        ~ (p=0.093 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_1000_steps-8     35.43m ± 6%   31.84m ± 1%  -10.14% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_instant_query-8                  28.52m ± 1%   28.93m ± 1%   +1.42% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_100_steps-8    106.38m ± 1%   89.76m ± 1%  -15.62% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_1000_steps-8   1032.4m ± 2%   613.6m ± 2%  -40.56% (p=0.002 n=6)
geomean                                                               9.291m        7.854m       -15.47%

                                                                   │   Prometheus   │                Mimir                │
                                                                   │      B/op      │     B/op      vs base               │
Query/quantile(0.9,_a_1),_instant_query-8                              23.53Ki ± 0%   20.89Ki ± 0%  -11.24% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_100_steps-8                 33.46Ki ± 0%   24.93Ki ± 0%  -25.51% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_1000_steps-8               116.33Ki ± 0%   55.23Ki ± 1%  -52.53% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_instant_query-8                            190.1Ki ± 0%   153.1Ki ± 0%  -19.46% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_100_steps-8              1794.3Ki ± 0%   213.6Ki ± 0%  -88.10% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_1000_steps-8            16159.6Ki ± 0%   685.4Ki ± 1%  -95.76% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_instant_query-8                           3.328Mi ± 0%   2.631Mi ± 0%  -20.95% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_100_steps-8             35.076Mi ± 0%   3.744Mi ± 0%  -89.33% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_1000_steps-8            317.39Mi ± 0%   11.90Mi ± 1%  -96.25% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_instant_query-8                      23.70Ki ± 0%   22.26Ki ± 0%   -6.08% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_100_steps-8         33.59Ki ± 0%   26.29Ki ± 0%  -21.72% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_1000_steps-8       116.58Ki ± 0%   56.56Ki ± 1%  -51.49% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_instant_query-8                    224.7Ki ± 0%   174.0Ki ± 0%  -22.58% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_100_steps-8       988.7Ki ± 0%   552.7Ki ± 0%  -44.10% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_1000_steps-8      7.599Mi ± 0%   3.346Mi ± 0%  -55.97% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_instant_query-8                   4.124Mi ± 0%   3.265Mi ± 1%  -20.83% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_100_steps-8      18.92Mi ± 0%   10.64Mi ± 1%  -43.76% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_1000_steps-8    214.18Mi ± 0%   69.27Mi ± 1%  -67.66% (p=0.002 n=6)
geomean                                                                1.223Mi        486.0Ki       -61.21%

                                                                   │  Prometheus   │               Mimir                │
                                                                   │   allocs/op   │  allocs/op   vs base               │
Query/quantile(0.9,_a_1),_instant_query-8                               420.0 ± 0%    356.0 ± 0%  -15.24% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_100_steps-8                  630.0 ± 0%    365.0 ± 0%  -42.06% (p=0.002 n=6)
Query/quantile(0.9,_a_1),_range_query_with_1000_steps-8                2457.0 ± 0%    392.0 ± 0%  -84.05% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_instant_query-8                            2.341k ± 0%   2.264k ± 0%   -3.29% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_100_steps-8               3.669k ± 0%   2.676k ± 0%  -27.06% (p=0.002 n=6)
Query/quantile(0.9,_a_100),_range_query_with_1000_steps-8             14.480k ± 0%   5.304k ± 0%  -63.37% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_instant_query-8                           38.95k ± 0%   38.87k ± 0%   -0.22% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_100_steps-8              50.45k ± 0%   46.90k ± 0%   -7.02% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_1000_steps-8            115.27k ± 0%   99.03k ± 0%  -14.09% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_instant_query-8                       428.0 ± 0%    363.0 ± 0%  -15.19% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_100_steps-8          638.0 ± 0%    372.0 ± 0%  -41.69% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_1000_steps-8        2465.0 ± 0%    399.0 ± 0%  -83.81% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_instant_query-8                    2.664k ± 0%   2.579k ± 0%   -3.19% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_100_steps-8      23.086k ± 0%   3.107k ± 0%  -86.54% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_1000_steps-8    205.724k ± 0%   5.785k ± 0%  -97.19% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_instant_query-8                   45.07k ± 0%   44.95k ± 0%   -0.27% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_100_steps-8     453.27k ± 0%   55.25k ± 0%  -87.81% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_1000_steps-8    4115.0k ± 0%   109.2k ± 0%  -97.35% (p=0.002 n=6)
geomean                                                                11.54k        4.245k       -63.21%

                                                                   │  Prometheus   │                Mimir                │
                                                                   │       B       │      B        vs base               │
Query/quantile(0.9,_a_1),_instant_query-8                             69.27Mi ± 1%   68.80Mi ± 2%        ~ (p=0.310 n=6)
Query/quantile(0.9,_a_1),_range_query_with_100_steps-8                69.24Mi ± 1%   69.59Mi ± 1%        ~ (p=0.394 n=6)
Query/quantile(0.9,_a_1),_range_query_with_1000_steps-8               68.30Mi ± 1%   67.81Mi ± 1%        ~ (p=0.310 n=6)
Query/quantile(0.9,_a_100),_instant_query-8                           66.78Mi ± 1%   66.62Mi ± 1%        ~ (p=0.699 n=6)
Query/quantile(0.9,_a_100),_range_query_with_100_steps-8              67.27Mi ± 1%   67.25Mi ± 1%        ~ (p=1.000 n=6)
Query/quantile(0.9,_a_100),_range_query_with_1000_steps-8             70.37Mi ± 1%   68.24Mi ± 1%   -3.03% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_instant_query-8                          68.12Mi ± 1%   67.51Mi ± 1%        ~ (p=0.180 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_100_steps-8             75.13Mi ± 1%   69.97Mi ± 2%   -6.87% (p=0.002 n=6)
Query/quantile(0.9,_a_2000),_range_query_with_1000_steps-8           131.66Mi ± 1%   95.29Mi ± 1%  -27.63% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_instant_query-8                     69.64Mi ± 1%   69.04Mi ± 2%        ~ (p=0.457 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_100_steps-8        69.09Mi ± 2%   69.29Mi ± 1%        ~ (p=0.589 n=6)
Query/quantile_by_(l)_(0.1,_b_1),_range_query_with_1000_steps-8       67.82Mi ± 1%   67.50Mi ± 1%        ~ (p=0.485 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_instant_query-8                   66.91Mi ± 1%   66.76Mi ± 2%        ~ (p=0.485 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_100_steps-8      67.41Mi ± 2%   67.63Mi ± 1%        ~ (p=0.699 n=6)
Query/quantile_by_(l)_(0.1,_b_100),_range_query_with_1000_steps-8     73.80Mi ± 1%   73.80Mi ± 1%        ~ (p=0.937 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_instant_query-8                  68.50Mi ± 2%   68.52Mi ± 2%        ~ (p=0.853 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_100_steps-8     83.99Mi ± 1%   87.72Mi ± 1%   +4.44% (p=0.002 n=6)
Query/quantile_by_(l)_(0.1,_b_2000),_range_query_with_1000_steps-8    188.4Mi ± 1%   196.7Mi ± 7%        ~ (p=0.394 n=6)
geomean                                                               76.59Mi        75.03Mi        -2.03%


@@ -218,7 +232,7 @@ func (a *Aggregation) NextSeries(ctx context.Context) (types.InstantVectorSeries
}

// Construct the group and return it
seriesData, hasMixedData, err := thisGroup.aggregation.ComputeOutputSeries(a.TimeRange, a.MemoryConsumptionTracker)
seriesData, hasMixedData, err := thisGroup.aggregation.ComputeOutputSeries(a.paramData, a.TimeRange, a.MemoryConsumptionTracker, a.emitAnnotationParam)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is processed during the aggregations SeriesMetadata, so it should only be once. The values (a.paramData on this quoted line) are then passed into the quantile ComputeOutputSeries to be looked up.

Sorry, should have been clearer: I was referring to the work done to validate the parameter values and emit an annotation if it is not between 0 and 1 here - this will be repeated for every output group, and is the only reason we need to pass a emitParamAnnotationFunc to ComputeOutputSeries.

If we instead move that validation out of ComputeOutputSeries, then we'll only do it once regardless of the number of groups, and we don't need to pass emitParamAnnotationFunc to ComputeOutputSeries either.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Noticed something weird in the test cases though.

}

func (q *QuantileAggregation) Close() {
if q.Aggregation.ParamData.Samples != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that we're reaching into q.Aggregation here - I think it'd be OK to move this into Aggregation.Close().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a bit odd, but we also reach into here to set up the ParamData. The downside of moving it into Aggregation.Close is that it'll be unnecessarily checked for every other aggregation type.

defer types.PutInstantVectorSeriesData(data, memoryConsumptionTracker)

if len(data.Histograms) > 0 {
emitAnnotationFunc(func(_ string, expressionPosition posrange.PositionRange) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I realise this was what it was called previously, but I think emitAnnotation would be a better name for this parameter, with EmitAnnotationFunc remaining as the type.

(could be something for a follow-up PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave for followup

@jhesketh jhesketh merged commit cccd3cb into grafana:main Mar 7, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants