diff --git a/CHANGELOG.md b/CHANGELOG.md index efcc4dcd9ff..019e1e5bec9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -121,6 +121,8 @@ ### Documentation +* [FEATURE] Add tuning documentation. #9978 + ### Tools * [FEATURE] `splitblocks`: add new tool to split blocks larger than a specified duration into multiple blocks. #9517, #9779 @@ -657,7 +659,7 @@ * [ENHANCEMENT] Distributor: support disabling metric relabel rules per-tenant via the flag `-distributor.metric-relabeling-enabled` or associated YAML. #6970 * [ENHANCEMENT] Distributor: `-distributor.remote-timeout` is now accounted from the first ingester push request being sent. #6972 * [ENHANCEMENT] Storage Provider: `-.s3.sts-endpoint` sets a custom endpoint for AWS Security Token Service (AWS STS) in s3 storage provider. #6172 -* [ENHANCEMENT] Querier: add `cortex_querier_queries_storage_type_total ` metric that indicates how many queries have executed for a source, ingesters or store-gateways. Add `cortex_querier_query_storegateway_chunks_total` metric to count the number of chunks fetched from a store gateway. #7099,#7145 +* [ENHANCEMENT] Querier: add `cortex_querier_queries_storage_type_total` metric that indicates how many queries have executed for a source, ingesters or store-gateways. Add `cortex_querier_query_storegateway_chunks_total` metric to count the number of chunks fetched from a store gateway. #7099,#7145 * [ENHANCEMENT] Query-frontend: add experimental support for sharding active series queries via `-query-frontend.shard-active-series-queries`. #6784 * [ENHANCEMENT] Distributor: set `-distributor.reusable-ingester-push-workers=2000` by default and mark feature as `advanced`. #7128 * [ENHANCEMENT] All: set `-server.grpc.num-workers=100` by default and mark feature as `advanced`. #7131 @@ -1594,9 +1596,9 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * `cortex_ingester_queried_ephemeral_samples` * `cortex_ingester_queried_ephemeral_series` * [CHANGE] Store-gateway: use mmap-less index-header reader by default and remove mmap-based index header reader. The following flags have changed: #4280 - * `-blocks-storage.bucket-store.index-header.map-populate-enabled` has been removed - * `-blocks-storage.bucket-store.index-header.stream-reader-enabled` has been removed - * `-blocks-storage.bucket-store.index-header.stream-reader-max-idle-file-handles` has been renamed to `-blocks-storage.bucket-store.index-header.max-idle-file-handles`, and the corresponding configuration file option has been renamed from `stream_reader_max_idle_file_handles` to `max_idle_file_handles` + * `-blocks-storage.bucket-store.index-header.map-populate-enabled` has been removed + * `-blocks-storage.bucket-store.index-header.stream-reader-enabled` has been removed + * `-blocks-storage.bucket-store.index-header.stream-reader-max-idle-file-handles` has been renamed to `-blocks-storage.bucket-store.index-header.max-idle-file-handles`, and the corresponding configuration file option has been renamed from `stream_reader_max_idle_file_handles` to `max_idle_file_handles` * [CHANGE] Store-gateway: the streaming store-gateway is now enabled by default. The new default setting for `-blocks-storage.bucket-store.batch-series-size` is `5000`. #4330 * [CHANGE] Compactor: the configuration parameter `-compactor.consistency-delay` has been deprecated and will be removed in Mimir 2.9. #4409 * [CHANGE] Store-gateway: the configuration parameter `-blocks-storage.bucket-store.consistency-delay` has been deprecated and will be removed in Mimir 2.9. #4409 @@ -1895,6 +1897,7 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * Reviewing any possible extensions to `genericBlocksStorageConfig`, `rulerClientConfig` and `alertmanagerStorageClientConfig` and moving them to the corresponding new options. * Renaming the alertmanager's bucket name configuration from provider-specific to the new `alertmanager_storage_bucket_name` key. * [CHANGE] The `overrides-exporter.libsonnet` file is now always imported. The overrides-exporter can be enabled in jsonnet setting the following: #3379 + ```jsonnet { _config+:: { @@ -1902,7 +1905,9 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. } } ``` + * [FEATURE] Added support for experimental read-write deployment mode. Enabling the read-write deployment mode on a existing Mimir cluster is a destructive operation, because the cluster will be re-created. If you're creating a new Mimir cluster, you can deploy it in read-write mode adding the following configuration: #3379 #3475 #3405 + ```jsonnet { _config+:: { @@ -1915,6 +1920,7 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. } } ``` + * [ENHANCEMENT] Add autoscaling support to the `mimir-read` component when running the read-write-deployment model. #3419 * [ENHANCEMENT] Added `$._config.usageStatsConfig` to track the installation mode via the anonymous usage statistics. #3294 * [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272 @@ -2096,11 +2102,12 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. ### Tools -- [BUGFIX] trafficdump: Fixed panic occurring when `-success-only=true` and the captured request failed. #2863 +* [BUGFIX] trafficdump: Fixed panic occurring when `-success-only=true` and the captured request failed. #2863 ## 2.3.1 ### Grafana Mimir + * [BUGFIX] Query-frontend: query sharding took exponential time to map binary expressions. #3027 * [BUGFIX] Distributor: Stop panics on OTLP endpoint when a single metric has multiple timeseries. #3040 @@ -2131,8 +2138,8 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * [CHANGE] Compactor: `-compactor.partial-block-deletion-delay` must either be set to 0 (to disable partial blocks deletion) or a value higher than `4h`. #2787 * [CHANGE] Query-frontend: CLI flag `-query-frontend.align-querier-with-step` has been deprecated. Please use `-query-frontend.align-queries-with-step` instead. #2840 * [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285 - - `-compactor.partial-block-deletion-delay`, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of `0`, the default, disables this feature. - - The metric `cortex_compactor_blocks_marked_for_deletion_total` has a new value for the `reason` label `reason="partial"`, when a block deletion marker is triggered by the partial block deletion delay. + * `-compactor.partial-block-deletion-delay`, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of `0`, the default, disables this feature. + * The metric `cortex_compactor_blocks_marked_for_deletion_total` has a new value for the `reason` label `reason="partial"`, when a block deletion marker is triggered by the partial block deletion delay. * [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429 * [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on `/otlp/v1/metrics`. #695 #2436 #2461 * [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467 @@ -2209,16 +2216,16 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431 * [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517 - - The following configuration options have been removed: - - `distributor_allow_multiple_replicas_on_same_node` - - `query_frontend_allow_multiple_replicas_on_same_node` - - `querier_allow_multiple_replicas_on_same_node` - - `ruler_allow_multiple_replicas_on_same_node` - - The following configuration options have been added: - - `distributor_topology_spread_max_skew` - - `query_frontend_topology_spread_max_skew` - - `querier_topology_spread_max_skew` - - `ruler_topology_spread_max_skew` + * The following configuration options have been removed: + * `distributor_allow_multiple_replicas_on_same_node` + * `query_frontend_allow_multiple_replicas_on_same_node` + * `querier_allow_multiple_replicas_on_same_node` + * `ruler_allow_multiple_replicas_on_same_node` + * The following configuration options have been added: + * `distributor_topology_spread_max_skew` + * `query_frontend_topology_spread_max_skew` + * `querier_topology_spread_max_skew` + * `ruler_topology_spread_max_skew` * [CHANGE] Change `max_global_series_per_metric` to 0 in all plans, and as a default value. #2669 * [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options `memberlist_cluster_label` and `memberlist_cluster_label_verification_disabled`. #2349 * [FEATURE] Added ruler-querier autoscaling support. It requires [KEDA](https://keda.sh) installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545 @@ -2263,12 +2270,12 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * [CHANGE] Increased default configuration for `-server.grpc-max-recv-msg-size-bytes` and `-server.grpc-max-send-msg-size-bytes` from 4MB to 100MB. #1884 * [CHANGE] Default values have changed for the following settings. This improves query performance for recent data (within 12h) by only reading from ingesters: #1909 #1921 - - `-blocks-storage.bucket-store.ignore-blocks-within` now defaults to `10h` (previously `0`) - - `-querier.query-store-after` now defaults to `12h` (previously `0`) -* [CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR https://github.com/cortexproject/cortex/pull/3910. #2253 + * `-blocks-storage.bucket-store.ignore-blocks-within` now defaults to `10h` (previously `0`) + * `-querier.query-store-after` now defaults to `12h` (previously `0`) +* [CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR . #2253 * [CHANGE] The following settings are now classified as advanced because the defaults should work for most users and tuning them requires in-depth knowledge of how the read path works: #1929 - - `-querier.query-ingesters-within` - - `-querier.query-store-after` + * `-querier.query-ingesters-within` + * `-querier.query-store-after` * [CHANGE] Config flag category overrides can be set dynamically at runtime. #1934 * [CHANGE] Ingester: deprecated `-ingester.ring.join-after`. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965 * [CHANGE] Blocks uploaded by ingester no longer contain `__org_id__` label. Compactor now ignores this label and will compact blocks with and without this label together. `mimirconvert` tool will remove the label from blocks as "unknown" label. #1972 @@ -2402,9 +2409,9 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * [CHANGE] Compactor: No longer upload debug meta files to object storage. #1257 * [CHANGE] Default values have changed for the following settings: #1547 - - `-alertmanager.alertmanager-client.grpc-max-recv-msg-size` now defaults to 100 MiB (previously was not configurable and set to 16 MiB) - - `-alertmanager.alertmanager-client.grpc-max-send-msg-size` now defaults to 100 MiB (previously was not configurable and set to 4 MiB) - - `-alertmanager.max-recv-msg-size` now defaults to 100 MiB (previously was 16 MiB) + * `-alertmanager.alertmanager-client.grpc-max-recv-msg-size` now defaults to 100 MiB (previously was not configurable and set to 16 MiB) + * `-alertmanager.alertmanager-client.grpc-max-send-msg-size` now defaults to 100 MiB (previously was not configurable and set to 4 MiB) + * `-alertmanager.max-recv-msg-size` now defaults to 100 MiB (previously was 16 MiB) * [CHANGE] Ingester: Add `user` label to metrics `cortex_ingester_ingested_samples_total` and `cortex_ingester_ingested_samples_failures_total`. #1533 * [CHANGE] Ingester: Changed `-blocks-storage.tsdb.isolation-enabled` default from `true` to `false`. The config option has also been deprecated and will be removed in 2 minor version. #1655 * [CHANGE] Query-frontend: results cache keys are now versioned, this will cause cache to be re-filled when rolling out this version. #1631 @@ -2453,15 +2460,15 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * [FEATURE] Ingester: Active series custom trackers now supports runtime tenant-specific overrides. The configuration has been moved to limit config, the ingester config has been deprecated. #1188 * [ENHANCEMENT] Alertmanager API: Concurrency limit for GET requests is now configurable using `-alertmanager.max-concurrent-get-requests-per-tenant`. #1547 * [ENHANCEMENT] Alertmanager: Added the ability to configure additional gRPC client settings for the Alertmanager distributor #1547 - - `-alertmanager.alertmanager-client.backoff-max-period` - - `-alertmanager.alertmanager-client.backoff-min-period` - - `-alertmanager.alertmanager-client.backoff-on-ratelimits` - - `-alertmanager.alertmanager-client.backoff-retries` - - `-alertmanager.alertmanager-client.grpc-client-rate-limit` - - `-alertmanager.alertmanager-client.grpc-client-rate-limit-burst` - - `-alertmanager.alertmanager-client.grpc-compression` - - `-alertmanager.alertmanager-client.grpc-max-recv-msg-size` - - `-alertmanager.alertmanager-client.grpc-max-send-msg-size` + * `-alertmanager.alertmanager-client.backoff-max-period` + * `-alertmanager.alertmanager-client.backoff-min-period` + * `-alertmanager.alertmanager-client.backoff-on-ratelimits` + * `-alertmanager.alertmanager-client.backoff-retries` + * `-alertmanager.alertmanager-client.grpc-client-rate-limit` + * `-alertmanager.alertmanager-client.grpc-client-rate-limit-burst` + * `-alertmanager.alertmanager-client.grpc-compression` + * `-alertmanager.alertmanager-client.grpc-max-recv-msg-size` + * `-alertmanager.alertmanager-client.grpc-max-send-msg-size` * [ENHANCEMENT] Ruler: Add more detailed query information to ruler query stats logging. #1411 * [ENHANCEMENT] Admin: Admin API now has some styling. #1482 #1549 #1821 #1824 * [ENHANCEMENT] Alertmanager: added `insight=true` field to alertmanager dispatch logs. #1379 @@ -2506,9 +2513,9 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. * Writes Networking from `681cd62b680b7154811fe73af55dcfd4` to `978c1cb452585c96697a238eaac7fe2d` * Writes Resources from `c0464f0d8bd026f776c9006b0591bb0b` to `bc9160e50b52e89e0e49c840fea3d379` * [FEATURE] Alerts: added the following alerts on `mimir-continuous-test` tool: #1676 - - `MimirContinuousTestNotRunningOnWrites` - - `MimirContinuousTestNotRunningOnReads` - - `MimirContinuousTestFailed` + * `MimirContinuousTestNotRunningOnWrites` + * `MimirContinuousTestNotRunningOnReads` + * `MimirContinuousTestFailed` * [ENHANCEMENT] Added `per_cluster_label` support to allow to change the label name used to differentiate between Kubernetes clusters. #1651 * [ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696 * [ENHANCEMENT] Playbooks: Add Alertmanager suggestions for `MimirRequestErrors` and `MimirRequestLatency` #1702 @@ -2523,6 +2530,7 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. ### Jsonnet * [FEATURE] Added support for `mimir-continuous-test`. To deploy `mimir-continuous-test` you can use the following configuration: #1675 #1850 + ```jsonnet _config+: { continuous_test_enabled: true, @@ -2531,6 +2539,7 @@ Querying with using `{__mimir_storage__="ephemeral"}` selector no longer works. continuous_test_read_endpoint: 'http://type-read-path-hostname/prometheus', }, ``` + * [ENHANCEMENT] Ingester anti-affinity can now be disabled by using `ingester_allow_multiple_replicas_on_same_node` configuration key. #1581 * [ENHANCEMENT] Added `node_selector` configuration option to select Kubernetes nodes where Mimir should run. #1596 * [ENHANCEMENT] Alertmanager: Added a `PodDisruptionBudget` of `withMaxUnavailable = 1`, to ensure we maintain quorum during rollouts. #1683 @@ -2733,7 +2742,7 @@ _Changes since Cortex 1.10.0._ | `//rules/{namespace}` | `/api/v1/rules/{namespace}` (see below) | `/config/v1/rules/{namespace}` | | `/ruler_ring` | `/ruler/ring` | | - > __Note:__ The `/api/v1/rules/**` endpoints are considered deprecated with Mimir 2.0.0 and will be removed + > **Note:** The `/api/v1/rules/**` endpoints are considered deprecated with Mimir 2.0.0 and will be removed in Mimir 2.2.0. After upgrading to 2.0.0 we recommend switching uses to the equivalent `//config/v1/**` endpoints that Mimir 2.0.0 introduces. @@ -3317,13 +3326,16 @@ _Changes since `grafana/cortex-jsonnet` `1.9.0`._ * `-compactor.max-closing-blocks-concurrency=2` * `-compactor.symbols-flushers-concurrency=4` * The following per-tenant overrides have been set on `super_user` and `mega_user` classes: + ``` compactor_split_and_merge_shards: 2, compactor_tenant_shard_size: 2, compactor_split_groups: 2, ``` + * [CHANGE] The entrypoint file to include has been renamed from `cortex.libsonnet` to `mimir.libsonnet`. #897 * [CHANGE] The default image config field has been renamed from `cortex` to `mimir`. #896 + ``` { _images+:: { @@ -3331,6 +3343,7 @@ _Changes since `grafana/cortex-jsonnet` `1.9.0`._ }, } ``` + * [CHANGE] Removed `cortex_` prefix from config fields. #898 * The following config fields have been renamed: * `cortex_bucket_index_enabled` renamed to `bucket_index_enabled` @@ -3367,6 +3380,7 @@ _Changes since `grafana/cortex-jsonnet` `1.9.0`._ * [CHANGE] gossip.libsonnet has been renamed to memberlist.libsonnet, and is now imported by default. Use of memberlist for ring is enabled by setting `_config.memberlist_ring_enabled` to true. #1526 * [FEATURE] Added query sharding support. It can be enabled setting `cortex_query_sharding_enabled: true` in the `_config` object. #653 * [FEATURE] Added shuffle-sharding support. It can be enabled and configured using the following config: #902 + ``` _config+:: { shuffle_sharding:: { @@ -3378,10 +3392,11 @@ _Changes since `grafana/cortex-jsonnet` `1.9.0`._ }, } ``` + * [FEATURE] Added multi-zone ingesters and store-gateways support. #1352 #1552 * [ENHANCEMENT] Add overrides config to compactor. This allows setting retention configs per user. [#386](https://github.com/grafana/cortex-jsonnet/pull/386) * [ENHANCEMENT] Added 256MB memory ballast to querier. [#369](https://github.com/grafana/cortex-jsonnet/pull/369) -* [ENHANCEMENT] Update `etcd-operator` to latest version (see https://github.com/grafana/jsonnet-libs/pull/480). [#263](https://github.com/grafana/cortex-jsonnet/pull/263) +* [ENHANCEMENT] Update `etcd-operator` to latest version (see ). [#263](https://github.com/grafana/cortex-jsonnet/pull/263) * [ENHANCEMENT] Add support for Azure storage in Alertmanager configuration. [#381](https://github.com/grafana/cortex-jsonnet/pull/381) * [ENHANCEMENT] Add support for running Alertmanager in sharding mode. [#394](https://github.com/grafana/cortex-jsonnet/pull/394) * [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. [#399](https://github.com/grafana/cortex-jsonnet/pull/399) @@ -3413,9 +3428,9 @@ _Changes since cortextool `0.10.7`._ * [CHANGE] Change `cortex` backend to `mimir`. #883 * [CHANGE] Do not publish `mimirtool` binary for 386 windows architecture. #1263 * [CHANGE] `analyse` command has been renamed to `analyze`. #1318 -* [FEATURE] Support Arm64 on Darwin for all binaries (benchtool etc). https://github.com/grafana/cortex-tools/pull/215 +* [FEATURE] Support Arm64 on Darwin for all binaries (benchtool etc). * [ENHANCEMENT] Correctly support federated rules. #823 -* [BUGFIX] Fix `cortextool rules` legends displaying wrong symbols for updates and deletions. https://github.com/grafana/cortex-tools/pull/226 +* [BUGFIX] Fix `cortextool rules` legends displaying wrong symbols for updates and deletions. ### Query-tee diff --git a/docs/sources/mimir/configure/tuning.md b/docs/sources/mimir/configure/tuning.md new file mode 100644 index 00000000000..daa1890aa13 --- /dev/null +++ b/docs/sources/mimir/configure/tuning.md @@ -0,0 +1,47 @@ +--- +description: Learn how to tune Grafan Mimir according to your use cases. +menuTitle: Tuning +title: Tune Grafana Mimir according to your use cases +weight: 110 +--- + +# Tune Grafana Mimir according to your use cases + +Grafana Mimir comes with sensible default settings. Those settings are a good place to start for most use cases. +However, for some use cases Grafana Mimir requires appropriate tuning to reach optimal performance. This page aims to centralize those known tuning. + +## Heavy multi-tenancy + +For each tenant, Grafana Mimir opens and maintains a TSDB in memory. With a significant number of tenants the memory overhead might come prohibitive. +To reduce the associated overhead, users might consider: + +- Reduce `-blocks-storage.tsdb.head-chunks-write-buffer-size-bytes`, default `4MB`. For example try `1MB`, or `128KB`. +- Reduce `-blocks-storage.tsdb.stripe-size`, default `16384`. For example try `256` or even `64`. +- Configure [shuffle sharding]({{< relref "./configure-shuffle-sharding" >}}) + +## Compression + +Depending on the CPU model used in the underlying infrastructure, the compression for both WALs and GRPC communication might consumes a significant part of the available CPU resources. +To identify such case one could rely on profiling with tools like [Grafana Pyroscope](https://grafana.com/docs/pyroscope/latest/) + +To reduce the resource consumption, users might consider: + +- Make sure `wal_compression_enabled` is not enable +- Make sure `grpc_compression` is either off which is the default or configured to `snappy`. `gzip` consumes more CPU than `snappy`. However, disabling `grpc_compression` implies more network traffic and in turn might increase the total cost of ownership (TCO) of running Grafana Mimir. + +If users must used compression, like for example to fit in the network bandwidth, they might consider using nodes with more powerful CPU. This implies a increase of the TCO. + +## Increase the cache size on a budget + +Grafana Mimir relies on Memcached for its caches. Memcached relies, by default only on the memory. +Similarly to the work of the Ops team behind Grafana Loki, one could enable the `extstore` feature of Memcached. + +See: [how we scaled Grafana Cloud Logs' Memcached cluster to 50TB and improved reliability](https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability/) + +## Periodic latency spikes when cutting blocks + +Depending on the workload, users might witness latency spikes when Grafana Mimir cuts blocks. +To reduce the impacts of this behavior, users might consider: + +- Upgrade to `2.15+`, See: +- Reduce `-blocks-storage.tsdb.block-ranges-period`, default `2h`, For example try `1h`