TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281

copejon · 2024-12-06T14:46:46Z

Which issue(s) this PR addresses:

Closes #

openshift-ci · 2024-12-06T14:46:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

packaging/rpm/microshift.spec

packaging/observability/opentelemetry-collector.yaml

packaging/rpm/microshift.spec

packaging/observability/microshift-observability.service

ggiguash · 2024-12-09T07:37:27Z

packaging/observability/microshift-observability.service

@@ -0,0 +1,27 @@
+[Unit]
+Description=MicroShift Observability
+BindsTo=microshift.service


Do we want to run the collector even when MicroShift fails?

I'd think yes. If MicroShift fails to start, the metrics and log data should still be collectable by the metrics/logging backend remotely.

packaging/observability/microshift-observability.service

ggiguash · 2024-12-09T07:49:41Z

/retitle NO-ISSUE: OpenTelemetry certificates and service for MicroShift

openshift-ci-robot · 2024-12-09T07:49:47Z

@copejon: This pull request explicitly references no jira issue.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

packaging/observability/opentelemetry-collector.yaml

openshift-ci-robot · 2024-12-12T20:47:24Z

@copejon: This pull request references TRACING-4752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

copejon · 2024-12-12T20:47:28Z

/jira refresh

openshift-ci-robot · 2024-12-12T20:47:31Z

@copejon: This pull request references TRACING-4752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ggiguash · 2025-01-23T08:30:03Z

packaging/rpm/microshift.spec

+Requires: opentelemetry-collector
+
+%description observability
+Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client


We need to be consistent in the naming case. Either fix this, or the Summary section, please.

Suggested change

Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client

Deploys the Red Hat build of OpenTelemetry-Collector as a systemd service on host. MicroShift provides client

ggiguash · 2025-01-23T08:32:37Z

packaging/rpm/microshift.spec

+certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists
+at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that
+the default configuration requires the backend endpoint be set by the user.


Suggested change

certificates to permit access to the kube-apiserver metrics endpoints. If a user defined opentelemetry-collector exists

at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that

the default configuration requires the backend endpoint be set by the user.

certificates to permit access to the kube-apiserver metrics endpoints. If a user-defined configuration file exists

at /etc/microshift/opentelemetry-collector.yaml, this configuration is used. Otherwise, a default configuration is provided.

Note that the default configuration requires the backend endpoint be set by the user.

For the backend endpoint, should we be specific on what we expect users to set?
I mean, should we say exporters.otlp section must be edited by users?

Added more specific instructions

ggiguash · 2025-01-23T08:39:33Z

packaging/observability/opentelemetry-collector.yaml

+# EXAMPLE OTLP (Prometheus) ENDPOINT CONFIG
+# The otlp exporter requires an endpoint listening for OTLP connections. To prevent spamming the log with Go
+# stack traces, the exporter is disabled. The endpoint is not known at installation, thus a tire-kicking of the
+# microshift-observability package would result in stack traces spam in logs.


Let's think what we can do so that the logs are not "spammed" when the default configuration is used. It sounds as if we should copy this file with .example suffix so that users would have to explicitly rename the file when they enable the collector service.

In any case, the "style" of this comment should be reworded.

I've tweaked the comment and made it a little more informational

ggiguash · 2025-01-23T08:48:46Z

packaging/observability/microshift-observability.service

@@ -0,0 +1,20 @@
+[Unit]
+Description=MicroShift Observability
+After=microshift.service


Should we use ConditionPathExists here for all the files the service expects to have before it starts?

The opentelemetry-collector performs that check for us each time it starts.

Right, but the point of the condition in systemd is not to attempt starting the service if the path does no exist.

Could this help to avoid unnecessary restarts?

ggiguash · 2025-01-23T08:53:35Z

packaging/observability/microshift-observability.service

+
+# It takes a bit for the certs to be created. This service will reach it's burst limit almost immediately, pretty much
+# guaranteeing that it will reach the restart limit before it can possibly succeed.
+RestartSec=200ms


Do we really need this? We've configured the service to start After microshift, so microshift must report readiness to systemd before the current service startup is attempted. MicroShift only reports readiness after creating all certificates.
What am I missing?

In earlier tests this was necessary to keep the service from crash looping, but that doesn't seem to be an issue in the latest opentelemetry-collector. Will remove

packaging/observability/microshift-observability.service

ggiguash · 2025-01-28T16:57:52Z

packaging/observability/opentelemetry-collector.yaml

+    auth_type: tls
+    ca_file: /etc/pki/microshift-opentelemetry-collector-client/client-ca.crt
+    key_file: /etc/pki/microshift-opentelemetry-collector-client/client.key
+    cert_file: /etc/pki/microshift-opentelemetry-collector-client/client.crt


These paths need to be updated too -> /var/lib/microshift/..../

ggiguash · 2025-01-28T17:01:59Z

packaging/rpm/microshift.spec

+certificates to permit access to the kube-apiserver metrics endpoints. If a user defined Opentelemetry-Collector exists
+at /etc/microshift/opentelemetry-collector.yaml, this config is used. Otherwise, a default config is provided. Note that
+the default configuration requires the backend endpoint be set by the user. The otlp export must also be specified as
+ .service.pipelines.$RECIEVER.exporter: "otlp".  The specification for the otlp config is:


Let's not use shortened words because it's a user-facing RPM description.

ggiguash · 2025-01-30T15:04:46Z

packaging/rpm/microshift.spec

+Requires: opentelemetry-collector
+
+%description observability
+Deploys the Red Hat build of Opentelemetry-Collector as a systemd service on host. MicroShift provides client


Please, fix the case of Opentelemety -> OpenTelemetry to make it consistent with the summary text.

ggiguash · 2025-03-06T11:40:26Z

test/suites/optional/observability.robot

+    [Documentation]    The service starts after MicroShift starts and thus will start generating pertinant log data
+    ...    right away. When the suite is executed, immediately get the cursor for the current
+    Setup Suite
+    ${cur}    Get Journal Cursor


Should we specify a unit here?
There seems to be a race condition here. Since we're testing a new unit, should we not just parse the logs from the beginning after we get enough lines there?

The missing unit was an oversight on my part. Fixed now. I don't understand what you mean about the race condition though. Can you elaborate?

test/suites/optional/observability.robot

assets/optional/observability/02-cluster-role.yaml

ggiguash · 2025-03-19T09:24:25Z

packaging/observability/opentelemetry-collector.yaml

+# workload resource usage for CPU, Memory, Disk, and Network. Included also are all cluster events of "Warning" type.
+
+# This configuration exports:
+# - Contain, Pod, and Node metrics


What is "Contain"?

oops, lost and er there :D

packaging/observability/opentelemetry-collector.yaml

packaging/rpm/microshift.spec

ggiguash · 2025-03-19T09:29:33Z

packaging/rpm/microshift.spec

+%dir %{_prefix}/lib/microshift/manifests.d/003-microshift-observability
+%dir %{_sharedstatedir}/microshift-observability
+%{_unitdir}/microshift-observability.service
+%{_presetdir}/90-enable-microshift-observability.preset


Why do we need this preset?

If the observability rpm were installed without the preset, then it's disabled by default and systemctl enable microshift command doesn't propagate to dependencies. That, coupled with the service settings RequiredBy=microshift.service, means that microshift itself can't start without the user manually enabling the observability service.

This way, the user isn't required to manage the service, as it's handled automatically.

What about the other way around? When microshift+observability are installed, microshift is disabled, but observability enabled by default. So, the observability service will fail on reboot.

We already have the following in the observability unit configuration.
Isn't it enough to couple service start / stop?

[Unit] PartOf=microshift.service After=microshift.service

Isn't it enough to couple service start / stop?

Well, no. In order for the RequiredBy= directive to take effect, the service must be enabled. But as you said, if it's enabled, then after a reboot it would start and enter a fail loop. This is fixed by applying the After=microshift.service.

So the .preset file + the RequiredBy= is what activates the dependency. The After= ensures that the service doesn't start until microshift is active. PartOf= propagates stops/restarts of the microshift service to the observability service.

But I think in the end this ended up being a long walk to accomplishing (basically) the same thing as adding microshift-observability to the microshift.service's Wants= directive.

test/suites/optional/observability.robot

ggiguash · 2025-03-19T09:31:07Z

test/suites/optional/observability.robot

+    Setup Suite
+    ${cur}    Get Journal Cursor    unit=microshift-observability
+    Set Suite Variable    ${JOURNAL_CUR}    ${cur}
+    Wait Until Keyword Succeeds    1 min    5 sec


We may want to increase the wait interval to say 5m?

If we've made it to this point, then the observability service is running and generating output. This check is more of a precaution to ensure the test is only checking output that's been generated very recently. Otherwise we may pickup errors or failures caused by service restarts or reboots. That can happen for instance if another test set is run before the observability tests.

pmtk · 2025-03-20T08:37:57Z

packaging/observability/opentelemetry-collector.yaml

@@ -0,0 +1,96 @@
+# Opentelemetry-collector-small.yaml provides a minimal set of metrics and logs for monitoring system, node, and


File name is different from what's in the comment

pmtk · 2025-03-20T08:44:46Z

packaging/observability/opentelemetry-collector.yaml

+      - microshift-observability
+      - microshift-etcd
+      - crio
+      - openvswitch.service


There are also ovsdb-server.service and ovs-vswitchd.service - should we include them?
In my experience, I think I saw more times ovsdb-server failing than the others

packaging/observability/opentelemetry-collector.yaml

pmtk · 2025-03-20T08:46:03Z

packaging/observability/opentelemetry-collector.yaml

+    metrics/kubeletstats:
+      receivers: [ kubeletstats ]
+      processors: [ batch ]
+      exporters: [ otlp ]  # Uncomment to enable OTLP


It looks uncommented to me :)

packaging/observability/opentelemetry-collector.yaml

test/scenarios/periodics/[email protected]

pmtk · 2025-03-20T09:10:18Z

test/scenarios/periodics/[email protected]

So... This is a scenario for ostree tests, but you changed bootc containerfile (test/image-blueprints-bootc/layer2-source/group2/rhel96-bootc-source-optionals.containerfile).
And that explains why:

setup of THIS scenario (ostree el94 src optional) failed

robot test cannot find the logs (it obtained cursor because the unit is up, but probably because of missing configuration in bootc x el96 x optional it didn't contain the logs it expected)

🤦🏻 good catch!

test/suites/optional/observability.robot

pmtk · 2025-03-21T12:17:50Z

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_microshift/4281/pull-ci-openshift-microshift-main-e2e-aws-tests-bootc/1902841887996252160:

+ 23:06:18.550008348 ./bin/scenario.sh:123 	scp -P 22 /tmp/tmp.lwpuNyOYUt [email protected]:/etc/microshift/opentelemetry-collector.yaml
dest open("/etc/microshift/opentelemetry-collector.yaml"): Permission denied
failed to upload file /tmp/tmp.lwpuNyOYUt to /etc/microshift/opentelemetry-collector.yaml

ggiguash · 2025-03-22T08:05:10Z

/test all

copejon · 2025-03-25T16:56:39Z

/test all

…elemetry-collector, preconfigured for microshift Signed-off-by: Jon Cope <[email protected]>

copejon · 2025-03-26T17:08:31Z

gateway api test failed, rerunning the test

/retest

ggiguash · 2025-03-28T08:46:06Z

/lgtm

openshift-ci · 2025-03-28T08:46:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon, ggiguash

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [copejon,ggiguash]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-03-28T10:58:02Z

/retest-required

Remaining retests: 0 against base HEAD 2e7d5dd and 2 for PR HEAD 689986c in total

ggiguash · 2025-03-28T13:49:57Z

/test e2e-aws-tests-bootc

openshift-ci-robot · 2025-03-28T15:58:10Z

/retest-required

Remaining retests: 0 against base HEAD bfe386c and 1 for PR HEAD 689986c in total

openshift-ci · 2025-03-28T17:30:04Z

@copejon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-tests-bootc-periodic-arm	`e759ad1`	link	true	`/test e2e-aws-tests-bootc-periodic-arm`
ci/prow/e2e-aws-tests-bootc-periodic	`e759ad1`	link	true	`/test e2e-aws-tests-bootc-periodic`
ci/prow/e2e-aws-tests-periodic	`e759ad1`	link	true	`/test e2e-aws-tests-periodic`
ci/prow/e2e-aws-tests-periodic-arm	`e759ad1`	link	true	`/test e2e-aws-tests-periodic-arm`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-03-28T19:47:16Z

/retest-required

Remaining retests: 0 against base HEAD bfe386c and 2 for PR HEAD 689986c in total

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2024

ggiguash reviewed Dec 9, 2024

View reviewed changes

openshift-ci bot changed the title ~~No issue generate otel cert~~ NO-ISSUE: OpenTelemetry certificates and service for MicroShift Dec 9, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 9, 2024

pavolloffay reviewed Dec 12, 2024

View reviewed changes

packaging/observability/opentelemetry-collector.yaml Show resolved Hide resolved

copejon changed the title ~~NO-ISSUE: OpenTelemetry certificates and service for MicroShift~~ TRACING-4752: Add OpenTelemetry-Collector as optional sub-package Dec 12, 2024

copejon force-pushed the no-issue-generate-otel-cert branch from fa4f579 to fede276 Compare December 12, 2024 21:05

copejon marked this pull request as ready for review January 21, 2025 16:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025

openshift-ci bot requested review from agullon and eslutsky January 21, 2025 16:46

copejon force-pushed the no-issue-generate-otel-cert branch from 2042714 to ccfea22 Compare January 22, 2025 23:20

ggiguash reviewed Jan 23, 2025

View reviewed changes

packaging/observability/microshift-observability.service Outdated Show resolved Hide resolved

ggiguash reviewed Jan 28, 2025

View reviewed changes

copejon force-pushed the no-issue-generate-otel-cert branch from aee833a to ad4892d Compare January 30, 2025 13:13

ggiguash reviewed Jan 30, 2025

View reviewed changes

copejon force-pushed the no-issue-generate-otel-cert branch from 33de178 to 2996e90 Compare February 11, 2025 19:58

ggiguash reviewed Mar 6, 2025

View reviewed changes

test/suites/optional/observability.robot Outdated Show resolved Hide resolved

copejon force-pushed the no-issue-generate-otel-cert branch 3 times, most recently from 23a7929 to 32155a5 Compare March 17, 2025 22:13

ggiguash reviewed Mar 19, 2025

View reviewed changes

pmtk reviewed Mar 20, 2025

View reviewed changes

agullon reviewed Mar 20, 2025

View reviewed changes

test/suites/optional/observability.robot Outdated Show resolved Hide resolved

copejon force-pushed the no-issue-generate-otel-cert branch from f342f6f to 35b1abe Compare March 20, 2025 21:56

copejon force-pushed the no-issue-generate-otel-cert branch from 35b1abe to 305e6dd Compare March 21, 2025 21:17

copejon force-pushed the no-issue-generate-otel-cert branch 3 times, most recently from 974fdf9 to dff2d54 Compare March 25, 2025 16:56

copejon force-pushed the no-issue-generate-otel-cert branch from dff2d54 to 8464679 Compare March 25, 2025 21:45

add microshift sub-package to enable the optional deployment of opent…

689986c

…elemetry-collector, preconfigured for microshift Signed-off-by: Jon Cope <[email protected]>

copejon force-pushed the no-issue-generate-otel-cert branch from cdc0d7c to 689986c Compare March 26, 2025 14:29

openshift-ci bot assigned ggiguash Mar 28, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2025

openshift-merge-bot bot merged commit 5aeeb9f into openshift:main Mar 28, 2025
13 checks passed

	Deploys the Red Hat build of Opentelemetry-collector as a systemd service on host. MicroShift provides client
	Deploys the Red Hat build of OpenTelemetry-Collector as a systemd service on host. MicroShift provides client

		@@ -0,0 +1,96 @@
		# Opentelemetry-collector-small.yaml provides a minimal set of metrics and logs for monitoring system, node, and

TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281

TRACING-4752: Add OpenTelemetry-Collector as optional sub-package #4281

Conversation

copejon commented Dec 6, 2024

openshift-ci bot commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggiguash commented Dec 9, 2024

openshift-ci-robot commented Dec 9, 2024

openshift-ci-robot commented Dec 12, 2024 • edited by openshift-ci bot Loading

copejon commented Dec 12, 2024

openshift-ci-robot commented Dec 12, 2024 • edited by openshift-ci bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggiguash Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggiguash Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggiguash Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmtk commented Mar 21, 2025

ggiguash commented Mar 22, 2025

copejon commented Mar 25, 2025

copejon commented Mar 26, 2025

ggiguash commented Mar 28, 2025

openshift-ci bot commented Mar 28, 2025

openshift-ci-robot commented Mar 28, 2025

ggiguash commented Mar 28, 2025

openshift-ci-robot commented Mar 28, 2025

openshift-ci bot commented Mar 28, 2025 • edited Loading

openshift-ci-robot commented Mar 28, 2025

openshift-ci-robot commented Dec 12, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Dec 12, 2024 •

edited by openshift-ci bot

Loading

ggiguash Jan 23, 2025 •

edited

Loading

ggiguash Mar 6, 2025 •

edited

Loading

ggiguash Mar 26, 2025 •

edited

Loading

openshift-ci bot commented Mar 28, 2025 •

edited

Loading