Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When -use-thanos-objstore=true .ruler_storage config partially overrides .ruler.storage ("Ruler storage is not configured" error); ruler must be configured in both places to start #16543

Open
ringerc opened this issue Mar 4, 2025 · 4 comments · May be fixed by #16555

Comments

@ringerc
Copy link

ringerc commented Mar 4, 2025

Describe the bug

When -use-thanos-objstore=true is passed to Loki to enable the new thanos object store client, the ruler storage configuration is read from a .ruler_storage block instead of the .ruler.storage block used when the thanos object store is not enabled. This is true even when Loki is configured to use only local storage.

Additionally, the .ruler.storage block is still used even when thanos object store is enabled (and thus .ruler_storage should be used instead):

  • to find the path to mkdir the rules directory during startup; and
  • to check whether Ruler is configured and should be started as part of the all or backend targets.

So when enabling the Thanos client the .ruler.storage block must still be present and contain a valid directory path, otherwise Loki will log "Ruler storage is not configured" and fail to start the Ruler.

So the .ruler.storage block is not wholly overridden by enabling the Thanos client.

When the Thanos client is enabled, .ruler_storage and .ruler.storage must BOTH be present, otherwise the the all and backend targets will log Ruler storage is not configured and skip ruler startup. The actual configuration is taken only from .ruler_storage. This is very confusing. There is no warning logged that .ruler.storage` is defined but will be ignored.

It looks like .ruler_storage is supposed to work like .storage_config.object_store; overriding the legacy config when -use-thanos-objstore=true. But it doesn't do so completely because the startup mkdir for the rules directory and the startup check for whether the ruler is configured both use the legacy config in .ruler.storage, even when using the thanos store.

The docs for the respective blocks are not at all clear about their mutually exclusive use with and without thanos client mode either.

The error when .ruler.storage 's path doesn't exist is:

mkdir /ruler_rules: permission denied
error initialising module: ruler-storage
github.com/grafana/dskit/modules.(*Manager).initModule
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:495
main.main
	/src/loki/cmd/loki/main.go:129
runtime.main
	/usr/local/go/src/runtime/proc.go:272
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1700

so it's coming from https://github.com/grafana/dskit when invoked by ruler-storage.

Illustrated in config excerpts:

ruler:
  # this is mostly ignored if -use-thanos-objstore=true is passed, but for 'all' and 'backend' targets
  # it must be present and have some valid-looking storage configuration or Loki will log "Ruler storage is not configured"
  # and skip starting the Ruler. Also, Loki will make a startup-time attempt to mkdir the .ruler.storage.local.directory even
  # though the path actually used by ruler when thanos client is enabled will be the one in .ruler_storage.
  storage:
    type: local
    local:
      # if defined this must be mkdir()able even when -use-thanos-objstore=true, though this config is otherwise ignored
      directory: /ruler_rules

# this is only used if  -use-thanos-objstore=true in which case it mostly overrides .ruler.storage, except for the startup path check
# and the decision on whether to start ruler in the 'backend' and 'all' targets. This is the "real" config, but the .ruler.storage
# legacy config cannot be deleted.
ruler_storage:
   backend: local
   local:
     directory: /ruler_storage_rules

So there are a two bugs here, and some docs confusion/UX issues:

  • startup path for all and backend uses only ruler.storage when checking whether ruler is configured; ignores .ruler_storage and the thanos client mode
  • startup path for all and backend uses only ruler.storage when trying to mkdir() the rules directory
  • docs don't make it at all clear that .ruler.storage is (aside from the bugs above) ignored when thanos client is enabled, in favour of .ruler_storage
  • Docs say .ruler.storage is deprecated and to use .ruler_storage instead, but actually .ruler_storage gets ignored unless the thanos client is enabled.

To Reproduce

Use the attached shell script and loki configuration to explore the permutations of the configuration in a Docker container. Tested with Loki 3.3.2 and with Loki 3.4.2. See comments in loki-config.yaml and the script for details. It'll print the docker cli including Loki args when it runs.

Files:

Run

mv validate-config-sh.txt validate-config.sh
mv loki-config-yaml.txt loki-config.yaml

to work around GH's bizarre file extension limits. Then:

  • Run ./validate-config.sh -t all -m run to see the non-thanos-store behaviour, where .ruler.storage.local.directory is used for everything
  • Run ./validate-config.sh -t all -m -o run to add -use-thanos-objstore=true but not define a .ruler_storage block. Loki will log Ruler storage is not configured; ruler will not be started. even though .ruler.storage.config.directory is still present.
  • Run ./validate-config.sh -t all -m -o -r run to inject a .ruler_storage block. The ruler will now start, but will complain msg="unable to list rules" err="unable to read dir /ruler_storage_rules: open /ruler_storage_rules: no such file or directory" because it's using the value from .ruler_storage.local.directory not the one for .ruler.storage.local.directory, and we didn't add a tmpfs mount for that.
  • Remove the tmpfs mount for the .ruler.storage.local.directory path by omitting the -m flag: ./validate-config.sh -t all -o -r run. Loki will now error-exit with mkdir /ruler_rules: permission denied\nerror initialising module: ruler-storage because it's trying to use the .ruler.storage.local.directory path in its startup check, even though it'll use .ruler_storage.local.directory everywhere else, and should be ignoring this configuration.

Now delete the .ruler.storage block completely and re-run ./validate-config.sh -t all -o -r run. Even though it's using the thanos client and has a .ruler_storage block, Loki will log Ruler storage is not configured; ruler will not be started and will fail to start the Ruler.

But if you run ./validate-config.sh -t ruler -o -r run to request just the ruler target be run with the thanos client and ruler_storage block, it'll start fine, because it bypasses the broken logic in the all and backend targets that use the wrong configuration.

If you want to repro manually instead:

  • try starting Loki with -target=all -use-thanos-objstore=true with the original config file. It'll skip starting Ruler.
  • Add a .ruler_storage block. It'll start Ruler.
  • make .ruler.storage.local.directory point to something that Loki can't write to. Loki will try to create the dir and fail to start, even though it's otherwise unused.
  • Delete the .ruler.storage block completely. Ruler will again skip starting, even though there's a .ruler_storage block.
  • Now run --target=ruler instead of --target=all. It'll start fine, because the code that's using the wrong config block seems to be part of the all and backend targets' startup logic, not Ruler itself.

Expected behavior

I expected .ruler.storage to be used whether or not -use-thanos-objstore=true is passed, especially when I'm using local ruler storage anyway.

Alternately, when -use-thanos-objstore=true and .ruler.storage is present, I would have expected to see a "warning: .ruler.storage overridden by .ruler_storage when -use-thanos-objstore=true" and the .ruler.storage block completely ignored, rather than still being used for a mkdir on startup and to determine whether or not to start Ruler.

(A related bug is that .ruler_storage lacks support for named_stores, so it's also inconsistent with .storage_config.object_storage; see #16544 )

Environment:

  • Infrastructure: All (repro'd with local and k8s)
  • Deployment tool: N/A (repro'd with docker cli and with k8s)
  • Container image: grafana/loki:3.3.2 and grafana/loki:3.4.2 . Note that the image has Entrypoint set to /usr/bin/loki so it's not running any shell script before running Loki.

Screenshots, Promtail config, or terminal output

Inline, above.

See also a related issue with ruler storage configuration not supporting named object stores: #16544

Note that the docs say

ruler_storage:
  # The thanos_object_store_config block configures the connection to object
  # storage backend using thanos-io/objstore clients. This will become the
  # default way of configuring object store clients in future releases.
  # Currently this is opt-in and takes effect only when `-use-thanos-objstore`
  # is set to true.
  # The CLI flags prefix for this block configuration is: ruler-storage
  [<thanos_object_store_config>]

but they do not mention that this replaces the .ruler.storage block, which is ignored. Nor does .ruler.storage mention it's ignored and replaced by .ruler_storage when the thanos client is enabled.

@ringerc ringerc changed the title When -use-thanos-objstore=true .ruler_storage config partially overrides .ruler.storage ("Ruler storage is not configured" error) When -use-thanos-objstore=true .ruler_storage config partially overrides .ruler.storage ("Ruler storage is not configured" error); ruler must be configured in both places to start Mar 5, 2025
ringerc added a commit to ringerc/loki that referenced this issue Mar 5, 2025
Make the storage configuration documentation more explicit
that the Thanos object store client configuration is mututally
exclusive with the legacy configuration. When the Thanos client
is enabled, legacy configuration is all silently ignored, including
legacy named stores.

Document that the .ruler.storage configuration is only deprecated when
the thanos storage client is enabled. It is still the required, and
only, method of configuring the Ruler's storage when using the legacy
object store.

Document that .ruler.storage is mostly ignored when the thanos client is
enabled, even when only using local storage, but it must still be
present due to bug grafana#16543. .ruler_storage is where the actual
configuration is read from in thanos mode.

Warn that the ruler storage configuration does not support named stores.
@ringerc
Copy link
Author

ringerc commented Mar 5, 2025

This commit seems to try to do the right thing for the check about whether ruler is configured:

$ git show 8bca2e76089e0b9894b7a4c18a950f4baaa5a412 -- pkg/loki/modules.go
commit 8bca2e76089e0b9894b7a4c18a950f4baaa5a412
Author: Ashwanth <[email protected]>
Date:   Tue Oct 29 14:46:11 2024 +0530

    feat(ruler): enables ruler store that uses clients from thanos-io/objstore pkg (#11713)

diff --git a/pkg/loki/modules.go b/pkg/loki/modules.go
index 86b5e6681..8e5c76e88 100644
--- a/pkg/loki/modules.go
+++ b/pkg/loki/modules.go
@@ -1232,7 +1232,8 @@ func (t *Loki) initRulerStorage() (_ services.Service, err error) {
        // to determine if it's unconfigured.  the following check, however, correctly tests this.
        // Single binary integration tests will break if this ever drifts
        legacyReadMode := t.Cfg.LegacyReadTarget && t.Cfg.isTarget(Read)
-       if (t.Cfg.isTarget(All) || legacyReadMode || t.Cfg.isTarget(Backend)) && t.Cfg.Ruler.StoreConfig.IsDefaults() {
+       storageNotConfigured := (t.Cfg.StorageConfig.UseThanosObjstore && t.Cfg.RulerStorage.IsDefaults()) || t.Cfg.Ruler.StoreConfig.IsDefaults()
+       if (t.Cfg.isTarget(All) || legacyReadMode || t.Cfg.isTarget(Backend)) && storageNotConfigured {
                level.Info(util_log.Logger).Log("msg", "Ruler storage is not configured; ruler will not be started.")
                return
        }
@@ -1245,7 +1246,11 @@ func (t *Loki) initRulerStorage() (_ services.Service, err error) {
                }
        }
 
-       t.RulerStorage, err = base_ruler.NewLegacyRuleStore(t.Cfg.Ruler.StoreConfig, t.Cfg.StorageConfig.Hedging, t.ClientMetrics, ruler.GroupLoader{}, util_log.Logger)
+       if t.Cfg.StorageConfig.UseThanosObjstore {
+               t.RulerStorage, err = base_ruler.NewRuleStore(context.Background(), t.Cfg.RulerStorage, t.Overrides, ruler.GroupLoader{}, util_log.Logger)
+       } else {
+               t.RulerStorage, err = base_ruler.NewLegacyRuleStore(t.Cfg.Ruler.StoreConfig, t.Cfg.StorageConfig.Hedging, t.ClientMetrics, ruler.GroupLoader{}, util_log.Logger)
+       }
 
        return
 }

from #15345

It's not immediately clear how this would be failing. This code is in:

➜  loki git:(fix-thanos-store-docs) ✗ git --no-pager tag --contains 8bca2e76089e0b9894b7a4c18a950f4baaa5a412 'v3.*'
v3.3.0
v3.3.1
v3.3.2
v3.3.3
v3.4.0
v3.4.1
v3.4.2

so should've been working. Issue with t.Cfg.RulerStorage.IsDefaults() ?

@ringerc
Copy link
Author

ringerc commented Mar 5, 2025

The mkdir bug is in

loki/pkg/loki/modules.go

Lines 1251 to 1257 in 4fa045d

// Make sure storage directory exists if using filesystem store
if t.Cfg.Ruler.StoreConfig.Type == "local" && t.Cfg.Ruler.StoreConfig.Local.Directory != "" {
err := chunk_util.EnsureDirectory(t.Cfg.Ruler.StoreConfig.Local.Directory)
if err != nil {
return nil, err
}
}

It fails to check for t.Cfg.StorageConfig.UseThanosObjstore and use t.Cfg.RulerStorage instead of t.Cfg.Ruler.StoreConfig

I'm not sure where the startup-check bug where ruler isn't launched is coming from yet.

@ringerc
Copy link
Author

ringerc commented Mar 5, 2025

Ah, worked it out.

storageNotConfigured := (t.Cfg.StorageConfig.UseThanosObjstore && t.Cfg.RulerStorage.IsDefaults()) || t.Cfg.Ruler.StoreConfig.IsDefaults()

if t.Cfg.StorageConfig.UseThanosObjstore is true, t.Cfg.Ruler.StoreConfig.IsDefaults() is still tested, and the condition is an OR so if this is unconfigured, storageNotConfigured is true.

@ringerc
Copy link
Author

ringerc commented Mar 5, 2025

Fixes in #16555

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant