-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281
TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281
Conversation
Skipping CI for Draft Pull Request. |
@@ -0,0 +1,27 @@ | |||
[Unit] | |||
Description=MicroShift Observability | |||
BindsTo=microshift.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run the collector even when MicroShift fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think yes. If MicroShift fails to start, the metrics and log data should still be collectable by the metrics/logging backend remotely.
/retitle NO-ISSUE: OpenTelemetry certificates and service for MicroShift |
@copejon: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@copejon: This pull request references TRACING-4752 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@copejon: This pull request references TRACING-4752 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fa4f579
to
fede276
Compare
2042714
to
ccfea22
Compare
packaging/rpm/microshift.spec
Outdated
Requires: opentelemetry-collector | ||
|
||
%description observability | ||
Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be consistent in the naming case. Either fix this, or the Summary section, please.
Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client | |
Deploys the Red Hat build of OpenTelemetry-Collector as a systemd service on host. MicroShift provides client |
packaging/rpm/microshift.spec
Outdated
certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists | ||
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that | ||
the default configuration requires the backend endpoint be set by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists | |
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that | |
the default configuration requires the backend endpoint be set by the user. | |
certificates to permit access to the kube-apiserver metrics endpoints. If a user-defined configuration file exists | |
at /etc/microshift/opentelemetry-collector.yaml, this configuration is used. Otherwise, a default configuration is provided. | |
Note that the default configuration requires the backend endpoint be set by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the backend endpoint, should we be specific on what we expect users to set?
I mean, should we say exporters.otlp
section must be edited by users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more specific instructions
# EXAMPLE OTLP (Prometheus) ENDPOINT CONFIG | ||
# The otlp exporter requires an endpoint listening for OTLP connections. To prevent spamming the log with Go | ||
# stack traces, the exporter is disabled. The endpoint is not known at installation, thus a tire-kicking of the | ||
# microshift-observability package would result in stack traces spam in logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think what we can do so that the logs are not "spammed" when the default configuration is used. It sounds as if we should copy this file with .example
suffix so that users would have to explicitly rename the file when they enable the collector service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, the "style" of this comment should be reworded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tweaked the comment and made it a little more informational
@@ -0,0 +1,20 @@ | |||
[Unit] | |||
Description=MicroShift Observability | |||
After=microshift.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use ConditionPathExists
here for all the files the service expects to have before it starts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opentelemetry-collector performs that check for us each time it starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the point of the condition in systemd is not to attempt starting the service if the path does no exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this help to avoid unnecessary restarts?
|
||
# It takes a bit for the certs to be created. This service will reach it's burst limit almost immediately, pretty much | ||
# guaranteeing that it will reach the restart limit before it can possibly succeed. | ||
RestartSec=200ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? We've configured the service to start After
microshift, so microshift must report readiness to systemd before the current service startup is attempted. MicroShift only reports readiness after creating all certificates.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In earlier tests this was necessary to keep the service from crash looping, but that doesn't seem to be an issue in the latest opentelemetry-collector. Will remove
auth_type: tls | ||
ca_file: /etc/pki/microshift-opentelemetry-collector-client/client-ca.crt | ||
key_file: /etc/pki/microshift-opentelemetry-collector-client/client.key | ||
cert_file: /etc/pki/microshift-opentelemetry-collector-client/client.crt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These paths need to be updated too -> /var/lib/microshift/..../
packaging/rpm/microshift.spec
Outdated
certificates to permit access to the kube-apiserver metrics endpoints. If a user defined Opentelemetry-Collector exists | ||
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that | ||
the default configuration requires the backend endpoint be set by the user. The otlp export must also be specified as | ||
.service.pipelines.$RECIEVER.exporter: "otlp". The specification for the otlp config is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use shortened words because it's a user-facing RPM description.
aee833a
to
ad4892d
Compare
packaging/rpm/microshift.spec
Outdated
Requires: opentelemetry-collector | ||
|
||
%description observability | ||
Deploys the Red Hat build of Opentelemetry-Collector as a systemd service on host. MicroShift provides client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, fix the case of Opentelemety -> OpenTelemetry to make it consistent with the summary text.
33de178
to
2996e90
Compare
[Documentation] The service starts after MicroShift starts and thus will start generating pertinant log data | ||
... right away. When the suite is executed, immediately get the cursor for the current | ||
Setup Suite | ||
${cur} Get Journal Cursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify a unit here?
There seems to be a race condition here. Since we're testing a new unit, should we not just parse the logs from the beginning after we get enough lines there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing unit was an oversight on my part. Fixed now. I don't understand what you mean about the race condition though. Can you elaborate?
23a7929
to
32155a5
Compare
# workload resource usage for CPU, Memory, Disk, and Network. Included also are all cluster events of "Warning" type. | ||
|
||
# This configuration exports: | ||
# - Contain, Pod, and Node metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "Contain"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, lost and er
there :D
%dir %{_prefix}/lib/microshift/manifests.d/003-microshift-observability | ||
%dir %{_sharedstatedir}/microshift-observability | ||
%{_unitdir}/microshift-observability.service | ||
%{_presetdir}/90-enable-microshift-observability.preset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this preset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the observability rpm were installed without the preset, then it's disabled by default and systemctl enable microshift
command doesn't propagate to dependencies. That, coupled with the service settings RequiredBy=microshift.service
, means that microshift itself can't start without the user manually enabling the observability service.
This way, the user isn't required to manage the service, as it's handled automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other way around? When microshift+observability are installed, microshift is disabled, but observability enabled by default. So, the observability service will fail on reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the following in the observability unit configuration.
Isn't it enough to couple service start / stop?
[Unit]
PartOf=microshift.service
After=microshift.service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it enough to couple service start / stop?
Well, no. In order for the RequiredBy=
directive to take effect, the service must be enabled
. But as you said, if it's enabled, then after a reboot it would start and enter a fail loop. This is fixed by applying the After=microshift.service
.
So the .preset file + the RequiredBy=
is what activates the dependency. The After=
ensures that the service doesn't start until microshift is active. PartOf=
propagates stops/restarts of the microshift service to the observability service.
But I think in the end this ended up being a long walk to accomplishing (basically) the same thing as adding microshift-observability
to the microshift.service
's Wants=
directive.
Setup Suite | ||
${cur} Get Journal Cursor unit=microshift-observability | ||
Set Suite Variable ${JOURNAL_CUR} ${cur} | ||
Wait Until Keyword Succeeds 1 min 5 sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to increase the wait interval to say 5m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we've made it to this point, then the observability service is running and generating output. This check is more of a precaution to ensure the test is only checking output that's been generated very recently. Otherwise we may pickup errors or failures caused by service restarts or reboots. That can happen for instance if another test set is run before the observability tests.
@@ -0,0 +1,96 @@ | |||
# Opentelemetry-collector-small.yaml provides a minimal set of metrics and logs for monitoring system, node, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File name is different from what's in the comment
- microshift-observability | ||
- microshift-etcd | ||
- crio | ||
- openvswitch.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also ovsdb-server.service
and ovs-vswitchd.service
- should we include them?
In my experience, I think I saw more times ovsdb-server failing than the others
metrics/kubeletstats: | ||
receivers: [ kubeletstats ] | ||
processors: [ batch ] | ||
exporters: [ otlp ] # Uncomment to enable OTLP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks uncommented to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... This is a scenario for ostree tests, but you changed bootc containerfile (test/image-blueprints-bootc/layer2-source/group2/rhel96-bootc-source-optionals.containerfile
).
And that explains why:
- setup of THIS scenario (ostree el94 src optional) failed
- robot test cannot find the logs (it obtained cursor because the unit is up, but probably because of missing configuration in
bootc x el96 x optional
it didn't contain the logs it expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦🏻 good catch!
f342f6f
to
35b1abe
Compare
|
35b1abe
to
305e6dd
Compare
/test all |
974fdf9
to
dff2d54
Compare
/test all |
dff2d54
to
8464679
Compare
…elemetry-collector, preconfigured for microshift Signed-off-by: Jon Cope <[email protected]>
cdc0d7c
to
689986c
Compare
gateway api test failed, rerunning the test /retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon, ggiguash The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-aws-tests-bootc |
@copejon: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Which issue(s) this PR addresses:
Closes #