Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tt status: print detailed alerts for instances #927

Conversation

mandesero
Copy link
Contributor

This patch improves the tt status command by displaying all collected errors, warnings, and the current instance status. Instances with warnings or errors will now be highlighted in yellow.

Closes #836

@oleg-jukovec
Copy link
Contributor

Please provide examples of how it looks like.

@mandesero mandesero marked this pull request as draft August 20, 2024 12:35
@mandesero
Copy link
Contributor Author

Please provide examples of how it looks like.

There are some examples:

  • call config:reload() (for example at app2:storage-001-a) with incorrect field in config:
 INSTANCE            STATUS   PID   MODE  ALERTS
 app2:storage-001-a  RUNNING  4552  RW    [error] [cluster_config] Unexpected field "invalid_opt"
 app2:storage-001-b  RUNNING  4553  RO
 app2:storage-001-c  RUNNING  4557  RO
 app2:router-001-a   RUNNING  4560  RW
  • set iproto.advertise.peer.login to a user that does not have the replication role.
 INSTANCE            STATUS   PID   MODE  ALERTS
 app2:storage-001-b  RUNNING  4887  RO    [Status]: orphan
                                          [Instance: storage-001-a] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3301
                                          [Instance: storage-001-c] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3303
 app2:storage-001-c  RUNNING  4891  RO    [Status]: orphan
                                          [Instance: storage-001-a] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3301
                                          [Instance: storage-001-b] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3302
 app2:router-001-a   RUNNING  4885  RW    [Status]: startup_in_progress
 app2:storage-001-a  RUNNING  4886  RO    [Status]: orphan
                                          [Instance: storage-001-c] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3303
                                          [Instance: storage-001-b] Upstream status: loading; Message: Read access to universe '' is denied for user 'alice'; Peer: alice@localhost:3302

@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch 3 times, most recently from 8166e11 to 23ff5a5 Compare August 20, 2024 13:46
@mandesero mandesero marked this pull request as ready for review August 21, 2024 08:56
cli/status/status.go Outdated Show resolved Hide resolved
cli/status/status.go Outdated Show resolved Hide resolved
cli/status/status.go Outdated Show resolved Hide resolved
@oleg-jukovec
Copy link
Contributor

oleg-jukovec commented Aug 21, 2024

Perhaps it's better to add an additional flag for this feature (-v/--verbose)? It looks like the output is now becoming less readable, and at the same time much more difficult to parse it in scripts.

An another idea: print alerts with the verbose flag, but the instance status by default. Something like this:

INSTANCE STATE   MODE PID STATUS
foo      RUNNING RW   123 orphan
bar      RUNNING RO   123 ready

I think we need @psergee opinion here.

@oleg-jukovec
Copy link
Contributor

oleg-jukovec commented Aug 21, 2024

image
status/status for the same instance looks confusing. I think a one needs to be clarified.

@Totktonada
Copy link
Member

I would discuss the output format in context of a human reader, because a machine-readable output is easier to implement separately (with some --json or --output json CLI option).

I think that if something is suspectful on an instance, we should let a user know about it by default. OTOH, we're not obligated to print anything if everything is OK. We can hide the column, for example. Or we can print these warnings on separate lines under the instance name if any.

The key idea is the following:

  • Don't flood with extra output if everything is OK.
  • Let a user know if something potentially goes wrong with the instances (by default, wthout extra flags).

@oleg-jukovec
Copy link
Contributor

oleg-jukovec commented Aug 21, 2024

I would discuss the output format in context of a human reader, because a machine-readable output is easier to implement separately (with some --json or --output json CLI option).

Now the output can be easily parsed using regular shell utilities. We can make things complicated here, or we can just keep it simple.

I think that if something is suspectful on an instance, we should let a user know about it by default. OTOH, we're not obligated to print anything if everything is OK.

An instance status looks interesting in most cases from my point of view. Even if it is running. The column appearing and disappearing time to time looks strange here.

@Totktonada
Copy link
Member

I would discuss the output format in context of a human reader, because a machine-readable output is easier to implement separately (with some --json or --output json CLI option).

Now the output can be easily parsed using regular shell utilities. We can make things complicated here, or we can just keep it simple.

I would separate goals of the tt status command and its properties.

In my understanding the goal is to show a human if something is wrong with the instances. Without the config:info().{status,alerts}, box.info.status, box.info.replication information the goal is not reached in typical scenarios. All the cases from #836 are from my experience with users who reached me during investigation of a problem and in all the cases tt status says nothing suspectful -- the instances are running.

In contrary, ability to parse the output using simple grep/awk expressions is a property, not the goal. And I guess that the expressions are more or less simple anyway (just add | grep -v '^ ' to exclude extra rows if we're about the currently implemented output format). awk's { print $2; } can be kept as before.

I think that if something is suspectful on an instance, we should let a user know about it by default. OTOH, we're not obligated to print anything if everything is OK.

An instance status looks interesting in most cases from my point of view. Even if it is running. The column appearing and disappearing time to time looks strange here.

Hm. I didn't mean to touch the existing status anyhow (which is RUNNING if the process exists AFAIU).

However, if we're discussing it, it actually brings not much information. It would be better to give some cumulative 'overview' status that takes into account the config:info() and the box.info information. For example, UNCALLABLE, ORPHAN, RO, RW, RO/HAS WARNINGS (RW/HAS WARNINGS), RO/HAS ERRORS (RW/HAS ERRORS). OTOH, it is not where I insist anyhow.

In the quoted sentence I meant to hide the extra information we're adding in this pull request if there is nothing important in it. I'm OK with showing the new column always or to hide it if it is empty. I'm OK to place all this information to its own rows (start it with some indent) and don't add any column.

IOW, I'm interested in this information to be present and it is not really important for me how exactly it is formatted.

I disagree with your proposal to hide the health information under a CLI option, because it is counter-intuitive to have status command, which just says whether a process is running. It is natural expectation that it checks the well known box.info indicators at least.

Copy link
Collaborator

@psergee psergee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch.

cli/status/status.go Outdated Show resolved Hide resolved
cli/status/status.go Outdated Show resolved Hide resolved
@oleg-jukovec
Copy link
Contributor

oleg-jukovec commented Aug 22, 2024

In my understanding the goal is to show a human if something is wrong with the instances. All the cases from #836 are from my experience with users who reached me during investigation of a problem and in all the cases tt status says nothing suspectful -- the instances are running.

I agree, but show a brief status of instances. To be able to see at a glance whether there are problems.

Without the config:info().{status,alerts}, box.info.status, box.info.replication information the goal is not reached in typical scenarios.

Wouldn't it be enough to create a new column just with box.info.status? This way, at a quick glance it will be immediately clear that something is going wrong.

The rest can be hidden under some verbose option to allow a user to make a brief diagnostic if he wants to.

tt is actively used in development. The fact that some instances did not in the running status - may be normal at some stages/debugging a problem. And each time going through many messages can be difficult and unnecessary.

@psergee
Copy link
Collaborator

psergee commented Aug 22, 2024

Perhaps it's better to add an additional flag for this feature (-v/--verbose)? It looks like the output is now becoming less readable, and at the same time much more difficult to parse it in scripts.

An another idea: print alerts with the verbose flag, but the instance status by default. Something like this:

INSTANCE STATE   MODE PID STATUS
foo      RUNNING RW   123 orphan
bar      RUNNING RO   123 ready

I think we need @psergee opinion here.

tt verbose mode is used to enable debug-level logging that is already present in the tt code, making it easier to troubleshoot issues. This is not about controlling the command output.
Yes, the output can become less readable with long lines. We need some way to control the tt status output. If the long line is not the last field, we should truncate it, so we need a maximum width for each field.

@Totktonada
Copy link
Member

Without the config:info().{status,alerts}, box.info.status, box.info.replication information the goal is not reached in typical scenarios.

Wouldn't it be enough to create a new column just with box.info.status? This way, at a quick glance it will be immediately clear that something is going wrong.

I agree that it would be nice to show some overall health status. Sadly, tarantool has no such field. box.info.status may report running, while there are problems with some of the upstreams, last configuration is not applied with an error and so on. If we want a short status, we should deduce it from a few different fields.

@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch from 23ff5a5 to de57025 Compare August 23, 2024 12:57
@mandesero
Copy link
Contributor Author

I suggest the following output format, here are examples.


Снимок экрана 2024-08-23 в 15 44 26 Снимок экрана 2024-08-23 в 15 44 44 Снимок экрана 2024-08-23 в 16 15 18

@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch 4 times, most recently from ca6f4fe to a1b35c4 Compare August 23, 2024 19:16
@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch 2 times, most recently from df23a3b to 6d22e37 Compare September 20, 2024 13:47
@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch 5 times, most recently from fae308c to b9e14c9 Compare September 30, 2024 14:12
cli/status/status.go Outdated Show resolved Hide resolved
@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch 4 times, most recently from a666c0b to 4aa7237 Compare October 1, 2024 10:13
@oleg-jukovec oleg-jukovec added the full-ci Enables full ci tests label Oct 1, 2024
@oleg-jukovec
Copy link
Contributor

Please, rebase to the master branch.

This patch enhances the `tt status` command to display all collected
errors, warnings, and the current status of instances.

Closes tarantool#836

@TarantoolBot document
Title: Enhanced `tt status` with detailed information.

The tt status command now displays statuses reported by `config:info()`,
`box.info`, and `box.info.replication*.upstream`. Additionally, you can
now view detailed reports of errors and warnings using the
`tt status --details` option.

Example:

You have an invalid config:

```yaml
...
invalid_option:
...
```

When you try to apply this config, you can check the status of the Tarantool
instance using `tt status`. You'll see that there is an error in the instance:

```
 INSTANCE            STATUS   PID     MODE  CONFIG        BOX      UPSTREAM
 app2:storage-001-b  RUNNING  168040  RO    check_errors  running  follow
```

To view detailed alerts, you can use the `--details` option with `tt status`:

```
Instance app2:storage-001-b:
    [config][error]: [cluster_config] Unexpected field "invalid_option"

 INSTANCE            STATUS   PID     MODE  CONFIG        BOX      UPSTREAM
 app2:storage-001-b  RUNNING  168040  RO    check_errors  running  follow
```
Add test to verify config, box, and upstream statuses and alerts
returned by the new `tt status [--details]` command.
@mandesero mandesero force-pushed the mandesero/gh-836-make-tt-status-informative branch from 4aa7237 to 19224a7 Compare October 1, 2024 13:55
@mandesero
Copy link
Contributor Author

Please, rebase to the master branch.

Rebased.

Copy link
Contributor

@oleg-jukovec oleg-jukovec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch!

Copy link
Collaborator

@psergee psergee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@oleg-jukovec oleg-jukovec merged commit 2a1ebb6 into tarantool:master Oct 2, 2024
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Enables full ci tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tt status: give more status information
5 participants