Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281

Merged

Conversation

copejon
Copy link
Contributor

@copejon copejon commented Dec 6, 2024

Which issue(s) this PR addresses:

Closes #

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024
Copy link
Contributor

openshift-ci bot commented Dec 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2024
@@ -0,0 +1,27 @@
[Unit]
Description=MicroShift Observability
BindsTo=microshift.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run the collector even when MicroShift fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think yes. If MicroShift fails to start, the metrics and log data should still be collectable by the metrics/logging backend remotely.

@ggiguash
Copy link
Contributor

ggiguash commented Dec 9, 2024

/retitle NO-ISSUE: OpenTelemetry certificates and service for MicroShift

@openshift-ci openshift-ci bot changed the title No issue generate otel cert NO-ISSUE: OpenTelemetry certificates and service for MicroShift Dec 9, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 9, 2024
@openshift-ci-robot
Copy link

@copejon: This pull request explicitly references no jira issue.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@copejon copejon changed the title NO-ISSUE: OpenTelemetry certificates and service for MicroShift TRACING-4752: Add OpenTelemetry-Collector as optional sub-package Dec 12, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 12, 2024

@copejon: This pull request references TRACING-4752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@copejon
Copy link
Contributor Author

copejon commented Dec 12, 2024

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 12, 2024

@copejon: This pull request references TRACING-4752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from fa4f579 to fede276 Compare December 12, 2024 21:05
@copejon copejon marked this pull request as ready for review January 21, 2025 16:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
@openshift-ci openshift-ci bot requested review from agullon and eslutsky January 21, 2025 16:46
@copejon copejon force-pushed the no-issue-generate-otel-cert branch from 2042714 to ccfea22 Compare January 22, 2025 23:20
Requires: opentelemetry-collector

%description observability
Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be consistent in the naming case. Either fix this, or the Summary section, please.

Suggested change
Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client
Deploys the Red Hat build of OpenTelemetry-Collector as a systemd service on host. MicroShift provides client

Comment on lines 232 to 234
certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that
the default configuration requires the backend endpoint be set by the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that
the default configuration requires the backend endpoint be set by the user.
certificates to permit access to the kube-apiserver metrics endpoints. If a user-defined configuration file exists
at /etc/microshift/opentelemetry-collector.yaml, this configuration is used. Otherwise, a default configuration is provided.
Note that the default configuration requires the backend endpoint be set by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the backend endpoint, should we be specific on what we expect users to set?
I mean, should we say exporters.otlp section must be edited by users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more specific instructions

# EXAMPLE OTLP (Prometheus) ENDPOINT CONFIG
# The otlp exporter requires an endpoint listening for OTLP connections. To prevent spamming the log with Go
# stack traces, the exporter is disabled. The endpoint is not known at installation, thus a tire-kicking of the
# microshift-observability package would result in stack traces spam in logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think what we can do so that the logs are not "spammed" when the default configuration is used. It sounds as if we should copy this file with .example suffix so that users would have to explicitly rename the file when they enable the collector service.

Copy link
Contributor

@ggiguash ggiguash Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, the "style" of this comment should be reworded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked the comment and made it a little more informational

@@ -0,0 +1,20 @@
[Unit]
Description=MicroShift Observability
After=microshift.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ConditionPathExists here for all the files the service expects to have before it starts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-collector performs that check for us each time it starts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the point of the condition in systemd is not to attempt starting the service if the path does no exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this help to avoid unnecessary restarts?


# It takes a bit for the certs to be created. This service will reach it's burst limit almost immediately, pretty much
# guaranteeing that it will reach the restart limit before it can possibly succeed.
RestartSec=200ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? We've configured the service to start After microshift, so microshift must report readiness to systemd before the current service startup is attempted. MicroShift only reports readiness after creating all certificates.
What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier tests this was necessary to keep the service from crash looping, but that doesn't seem to be an issue in the latest opentelemetry-collector. Will remove

auth_type: tls
ca_file: /etc/pki/microshift-opentelemetry-collector-client/client-ca.crt
key_file: /etc/pki/microshift-opentelemetry-collector-client/client.key
cert_file: /etc/pki/microshift-opentelemetry-collector-client/client.crt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths need to be updated too -> /var/lib/microshift/..../

certificates to permit access to the kube-apiserver metrics endpoints. If a user defined Opentelemetry-Collector exists
at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that
the default configuration requires the backend endpoint be set by the user. The otlp export must also be specified as
.service.pipelines.$RECIEVER.exporter: "otlp". The specification for the otlp config is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use shortened words because it's a user-facing RPM description.

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from aee833a to ad4892d Compare January 30, 2025 13:13
Requires: opentelemetry-collector

%description observability
Deploys the Red Hat build of Opentelemetry-Collector as a systemd service on host. MicroShift provides client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix the case of Opentelemety -> OpenTelemetry to make it consistent with the summary text.

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from 33de178 to 2996e90 Compare February 11, 2025 19:58
[Documentation] The service starts after MicroShift starts and thus will start generating pertinant log data
... right away. When the suite is executed, immediately get the cursor for the current
Setup Suite
${cur} Get Journal Cursor
Copy link
Contributor

@ggiguash ggiguash Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify a unit here?
There seems to be a race condition here. Since we're testing a new unit, should we not just parse the logs from the beginning after we get enough lines there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing unit was an oversight on my part. Fixed now. I don't understand what you mean about the race condition though. Can you elaborate?

@copejon copejon force-pushed the no-issue-generate-otel-cert branch 3 times, most recently from 23a7929 to 32155a5 Compare March 17, 2025 22:13
# workload resource usage for CPU, Memory, Disk, and Network. Included also are all cluster events of "Warning" type.

# This configuration exports:
# - Contain, Pod, and Node metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "Contain"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, lost and er there :D

%dir %{_prefix}/lib/microshift/manifests.d/003-microshift-observability
%dir %{_sharedstatedir}/microshift-observability
%{_unitdir}/microshift-observability.service
%{_presetdir}/90-enable-microshift-observability.preset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this preset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the observability rpm were installed without the preset, then it's disabled by default and systemctl enable microshift command doesn't propagate to dependencies. That, coupled with the service settings RequiredBy=microshift.service, means that microshift itself can't start without the user manually enabling the observability service.

This way, the user isn't required to manage the service, as it's handled automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other way around? When microshift+observability are installed, microshift is disabled, but observability enabled by default. So, the observability service will fail on reboot.

Copy link
Contributor

@ggiguash ggiguash Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the following in the observability unit configuration.
Isn't it enough to couple service start / stop?

[Unit]
PartOf=microshift.service
After=microshift.service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough to couple service start / stop?

Well, no. In order for the RequiredBy= directive to take effect, the service must be enabled. But as you said, if it's enabled, then after a reboot it would start and enter a fail loop. This is fixed by applying the After=microshift.service.

So the .preset file + the RequiredBy= is what activates the dependency. The After= ensures that the service doesn't start until microshift is active. PartOf= propagates stops/restarts of the microshift service to the observability service.

But I think in the end this ended up being a long walk to accomplishing (basically) the same thing as adding microshift-observability to the microshift.service's Wants= directive.

Setup Suite
${cur} Get Journal Cursor unit=microshift-observability
Set Suite Variable ${JOURNAL_CUR} ${cur}
Wait Until Keyword Succeeds 1 min 5 sec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to increase the wait interval to say 5m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've made it to this point, then the observability service is running and generating output. This check is more of a precaution to ensure the test is only checking output that's been generated very recently. Otherwise we may pickup errors or failures caused by service restarts or reboots. That can happen for instance if another test set is run before the observability tests.

@@ -0,0 +1,96 @@
# Opentelemetry-collector-small.yaml provides a minimal set of metrics and logs for monitoring system, node, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File name is different from what's in the comment

- microshift-observability
- microshift-etcd
- crio
- openvswitch.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also ovsdb-server.service and ovs-vswitchd.service - should we include them?
In my experience, I think I saw more times ovsdb-server failing than the others

metrics/kubeletstats:
receivers: [ kubeletstats ]
processors: [ batch ]
exporters: [ otlp ] # Uncomment to enable OTLP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks uncommented to me :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... This is a scenario for ostree tests, but you changed bootc containerfile (test/image-blueprints-bootc/layer2-source/group2/rhel96-bootc-source-optionals.containerfile).
And that explains why:

  • setup of THIS scenario (ostree el94 src optional) failed
  • robot test cannot find the logs (it obtained cursor because the unit is up, but probably because of missing configuration in bootc x el96 x optional it didn't contain the logs it expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏻 good catch!

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from f342f6f to 35b1abe Compare March 20, 2025 21:56
@pmtk
Copy link
Member

pmtk commented Mar 21, 2025

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_microshift/4281/pull-ci-openshift-microshift-main-e2e-aws-tests-bootc/1902841887996252160:

+ 23:06:18.550008348 ./bin/scenario.sh:123 	scp -P 22 /tmp/tmp.lwpuNyOYUt [email protected]:/etc/microshift/opentelemetry-collector.yaml
dest open("/etc/microshift/opentelemetry-collector.yaml"): Permission denied
failed to upload file /tmp/tmp.lwpuNyOYUt to /etc/microshift/opentelemetry-collector.yaml

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from 35b1abe to 305e6dd Compare March 21, 2025 21:17
@ggiguash
Copy link
Contributor

/test all

@copejon copejon force-pushed the no-issue-generate-otel-cert branch 3 times, most recently from 974fdf9 to dff2d54 Compare March 25, 2025 16:56
@copejon
Copy link
Contributor Author

copejon commented Mar 25, 2025

/test all

@copejon copejon force-pushed the no-issue-generate-otel-cert branch from dff2d54 to 8464679 Compare March 25, 2025 21:45
…elemetry-collector, preconfigured for microshift

Signed-off-by: Jon Cope <[email protected]>
@copejon copejon force-pushed the no-issue-generate-otel-cert branch from cdc0d7c to 689986c Compare March 26, 2025 14:29
@copejon
Copy link
Contributor Author

copejon commented Mar 26, 2025

gateway api test failed, rerunning the test

/retest

@ggiguash
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2025
Copy link
Contributor

openshift-ci bot commented Mar 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon, ggiguash

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2e7d5dd and 2 for PR HEAD 689986c in total

@ggiguash
Copy link
Contributor

/test e2e-aws-tests-bootc

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD bfe386c and 1 for PR HEAD 689986c in total

Copy link
Contributor

openshift-ci bot commented Mar 28, 2025

@copejon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-tests-bootc-periodic-arm e759ad1 link true /test e2e-aws-tests-bootc-periodic-arm
ci/prow/e2e-aws-tests-bootc-periodic e759ad1 link true /test e2e-aws-tests-bootc-periodic
ci/prow/e2e-aws-tests-periodic e759ad1 link true /test e2e-aws-tests-periodic
ci/prow/e2e-aws-tests-periodic-arm e759ad1 link true /test e2e-aws-tests-periodic-arm

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD bfe386c and 2 for PR HEAD 689986c in total

@openshift-merge-bot openshift-merge-bot bot merged commit 5aeeb9f into openshift:main Mar 28, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants