Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default log store backend to WAL and allow disabling verification #21700

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

dhiaayachi
Copy link
Contributor

@dhiaayachi dhiaayachi commented Sep 4, 2024

Description

This PR change the default log store config to use WAL when starting with a fresh database. If a bolt db already exist bolt db will be used as a backend and a warning will be logged.

It also allow the log verifier, enabled by default, to be disabled.

Testing & Reproduction steps

Added tests to verify combination of configs.

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@dhiaayachi dhiaayachi requested a review from a team as a code owner September 4, 2024 13:53
@github-actions github-actions bot added the theme/config Relating to Consul Agent configuration, including reloading label Sep 4, 2024
Copy link
Member

@jmurret jmurret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: 🔥

Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few docs ❓ and minor suggestions, but otherwise LGTM! 🚀

.changelog/21700.txt Outdated Show resolved Hide resolved
Comment on lines 1059 to 1060
if s.config.LogStoreConfig.Backend == LogStoreBackendDefault && !boltFileExists {
if (s.config.LogStoreConfig.Backend == LogStoreBackendDefault || s.config.LogStoreConfig.Backend == LogStoreBackendWAL) && !boltFileExists {
s.config.LogStoreConfig.Backend = LogStoreBackendWAL
Copy link
Member

@zalimeni zalimeni Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~ Should we consider moving the original

if s.config.LogStoreConfig.Backend == LogStoreBackendDefault && !boltFileExists {
  s.config.LogStoreConfig.Backend = LogStoreBackendWAL
}

bit up above the rest of this if block, and just check explicitly for WAL (not default) after?

Main thought that crossed my mind is we're treating default and WAL as equivalent in these checks once we get past the BoltDB detection gate, so normalizing in one place is less error-prone in case of future changes and separates the defaulting from the business logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking we could simplify the checks following the first defaulting block, so that we aren't repeating stuff like "default or WAL" and "!boltDB". Maybe something like this? (also switches the warning to "using BoltDB" since "ignoring 'wal'" might be confusing when default is used)

- Take a snapshot prior to testing.
- Monitor Consul server metrics and logs, and set an alert on specific log events that occur when WAL is enabled. Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for more information.
- Enable WAL in a pre-production environment and run it for a several days before enabling it in production.
WAL LogStore is now enabled by default
Copy link
Member

@zalimeni zalimeni Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still ignoring config if there's a BoltDB file found - should we call that out here (new installs only) similar to the main doc, and keep some instructions to transition existing servers if desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boruszak Can you please check the wording in here?

website/content/docs/agent/wal-logstore/enable.mdx Outdated Show resolved Hide resolved
Comment on lines -52 to -74
## Enable log verification

You must enable log verification on all voting servers in Enterprise and all servers in CE because the leader writes verification checkpoints.

1. On each voting server, add the following to the server's configuration file:

```hcl
raft_logstore {
verification {
enabled = true
interval = "60s"
}
}
```

1. Restart the server to apply the changes. The `consul reload` command is not sufficient to apply `raft_logstore` configuration changes.
1. Run the `consul operator raft list-peers` command to wait for each server to become a healthy voter before moving on to the next. This may take a few minutes for large snapshots.

When complete, the server's logs should contain verifier reports that appear like the following example:

```log hideClipboard
2023-01-31T14:44:31.174Z [INFO] agent.server.raft.logstore.verifier: verification checksum OK: elapsed=488.463268ms leaderChecksum=f15db83976f2328c rangeEnd=357802 rangeStart=298132 readChecksum=f15db83976f2328c
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this section on log verification still relevant info even when WAL is defaulted on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log verification is enabled by default now as part of WAL. The reasoning is that it have minimal impact but great benefits in case of bugs.

I will double check if it's documented as part of the logstore config properly.

website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be deleting the instructions in the wal-logstore/enable page. We should rephrase the language around "experimental" and callout that it's the default. But what if someone changes from WAL to BoltDB and then wants to change back?

website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/monitoring.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
Co-authored-by: Michael Zalimeni <[email protected]>
Co-authored-by: Jeff Boruszak <[email protected]>
@dhiaayachi
Copy link
Contributor Author

I don't think we should be deleting the instructions in the wal-logstore/enable page. We should rephrase the language around "experimental" and callout that it's the default. But what if someone changes from WAL to BoltDB and then wants to change back?

@boruszak I'm not sure I get your point 🤔. The aim of that page is to help users enable an experimental feature and make sure that it's working safely for them. Now that WAL is default that logic don't hold anymore as by making it default we implicitly admit to it being stable enough to make it default.

I agree on your point about reverting from WAL to boltdb being important but it's a simple configuration change and don't need any extra steps. The only thing I can think of and that we should call-out, and we can probably document in that page is that:

  • if you activate WAL with an existing BoltDB db the boltDB db will be used and WAL will not be activated
  • if you activate boltdb while a WAL db is present the WAL db will be ignored and the server will start with a new DB from scratch and the existing WAL db will be ignored.

So to sum it up, when changing the log store backend it's always recommended to:

  • Create a snapshot and verify it's not corrupted
  • Gracefully stop the server
  • change the log store config
  • delete the existing DB (wal or boltdb)
  • start the server
  • Wait for it to get its data replicated or restore the snapshot

WYT?

@JadhavPoonam
Copy link
Contributor

@dhiaayachi For someone who already has BoltDB what steps would they have to take to make the switch? Do we need/have some migration docs somewhere? 🤔

@dhiaayachi
Copy link
Contributor Author

@dhiaayachi For someone who already has BoltDB what steps would they have to take to make the switch? Do we need/have some migration docs somewhere? 🤔

@JadhavPoonam the procedure I highlighted in the comment above would be needed. We can add that as documentation.

@@ -1055,26 +1055,23 @@ func (s *Server) setupRaft() error {
stable = wal
return nil
}
// Only use WAL if there is no existing raft.db, even if it's enabled.

// Default to WAL if there is no existing raft.db, even if it's enabled. Log a warning otherwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Default to WAL if there is no existing raft.db, even if it's enabled. Log a warning otherwise
// Default to WAL. Only use WAL if there is no existing raft.db, even if it's enabled. Log a warning otherwise

~ "even if it's enabled part" doesn't make sense following "Default to WAL". IMO both clarifications are valuable, so we could just combine them.

Comment on lines +1062 to +1066
} else if s.config.LogStoreConfig.Backend == LogStoreBackendWAL || s.config.LogStoreConfig.Backend == LogStoreBackendDefault {
// User configured the new storage, but still has old raft.db. Warn
// them!
s.logger.Warn("BoltDB file raft.db found, IGNORING raft_logstore.backend which is set to 'wal'")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, I think this will execute in an unwanted scenario if the following are true:

  • s.config.LogStoreConfig.Backend == LogStoreBackendWAL
  • !boltFileExists
    because the first if condition would fail due to WAL being explicitly configured, but the second condition doesn't assert that the boltDB file is present.

IMO it'd be easier to understand this flow and avoid bugs if we separate the condition for the warning from the condition for setting WAL as the default. We could introduce a useWAL var if it seems that'd read more cleanly than this approach. Open to your ideas though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport theme/config Relating to Consul Agent configuration, including reloading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants