Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-52280, SDN-5330: Add ipsec connect wait service #4854

Merged
merged 2 commits into from
Mar 18, 2025

Conversation

pperiyasamy
Copy link
Member

@pperiyasamy pperiyasamy commented Feb 14, 2025

The IPsec upgrade CI job (e2e-aws-ovn-ipsec-upgrade) is never passing and failing with api connection disruptive events and some cluster operator are unavailable for more than allowed period. so we are addressing those issues
with this PR by fixing two things.

Add ipsec connect wait service

When node goes for a reboot on an IPsec enabled cluster, once it comes up, after ovn-ipsec-host pod is deployed, the ovs-monitor-ipsec process deployed by the pod parses /etc/ipsec.d/openshift.conf file, makes pluto daemon to
establish IKE SAs with peer nodes, but these are established after kubelet is started, workload pods scheduled on this node would fail communicating with other node pods until IPsec SAs are established.

So the commit e96fe31 adds wait-for-ipsec-connect.service systemd service which depends on ipsecenabler.service created by IPsec machine config. This new
service loads existing IPsec connections created by OVN/OVS into pluto daemon with "auto=start" option and waits upto 3 minutes until IPsec tunnels are established. This gives a better chance IPsec SAs are established even
before kubelet is started and when ovn-ipsec-host pod comes up later, it doesn't have to do anything for existing IPsec connections.

The wait-for-ipsec-connect.service service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by the network operator.

Add crio dependency for ipsec.service

We noticed pluto is tearing down established IPsec connections in parallel with crio stopping all pod containers which includes stopping api server pod container. It happens when node reboot initiated for rendering new
machine configs at the time of OCP upgrade.

This creates api connection disruptions in the cluster, these disruptions are generating events, caught by origin monitor tests and failing IPsec upgrade CI lane and it may also cause noticeable temporary pod traffic failure during
upgrade for IPsec enabled cluster.

Hence the commit fcf67a9 adds Before=crio.service dependency on the ipsec.service so that pluto daemon is stopped after the shutdown of crio service, all pod containers are stopped on the node. This gives enough room for clients to
gracefully move to another control plane node for API connections.

Big thanks to @igsilya for helping to troubleshoot and fixing this problem.

@pperiyasamy
Copy link
Member Author

/assign @tssurya

@pperiyasamy
Copy link
Member Author

/assign @jcaamano

@pperiyasamy
Copy link
Member Author

/assign @huiran0826

@pperiyasamy pperiyasamy force-pushed the ipsec-connect-wait branch 2 times, most recently from 158c96f to 07dd5a4 Compare March 10, 2025 14:18
@pperiyasamy pperiyasamy changed the title Add ipsec connect wait service SDN-5330: Add ipsec connect wait service Mar 11, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 11, 2025

@pperiyasamy: This pull request references SDN-5330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

When node goes for a reboot on an IPsec enabled cluster, once it comes up, libreswan parses /etc/ipsec.d/openshift.conf file and establishes SAs with peers and it may be still in progress even after kubelet is started, pod scheduled on this node would fail communicating with other pods until IPsec tunnels are established.
So this commit adds wait-for-ipsec-connect.service systemd service which depends on ipsecenabler.service created by IPsec machine config. This new service loads existing connections into libreswan with auto=start option for every connection and waits upto 3 minutes until IPsec tunnels are established. This service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by CNO.

TODO: observe ipsec-upgrade behavior with this in CI and need to revisit the logic as it needs to be enabled only on IPsec enabled clusters.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 12, 2025

@pperiyasamy: This pull request references SDN-5330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

The IPsec upgrade CI job (e2e-aws-ovn-ipsec-upgrade) is never passing and failing with api connection disruptive events and some cluster operator are unavailable for more than allowed period. so we are addressing those issues
with this PR by fixing two things.

Add ipsec connect wait service

When node goes for a reboot on an IPsec enabled cluster, once it comes up, after ovn-ipsec-host pod is deployed, the ovs-monitor-ipsec process deployed by the pod parses /etc/ipsec.d/openshift.conf file, makes pluto daemon to
establish IKE SAs with peer nodes, but these are established after kubelet is started, workload pods scheduled on this node would fail communicating with other node pods until IPsec SAs are established.

So the commit e96fe31 adds wait-for-ipsec-connect.service systemd service which depends on ipsecenabler.service created by IPsec machine config. This new
service loads existing IPsec connections created by OVN/OVS into pluto daemon with "auto=start" option and waits upto 3 minutes until IPsec tunnels are established. This gives a better chance IPsec SAs are established even
before kubelet is started and when ovn-ipsec-host pod comes up later, it doesn't have to do anything for existing IPsec connections.

The wait-for-ipsec-connect.service service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by the network operator.

Add crio dependency for ipsec.service

We noticed pluto is tearing down established IPsec connections in parallel with crio stopping all pod containers which includes stopping api server pod container. It happens when node reboot initiated for rendering new
machine configs at the time of OCP upgrade.

This creates api connection disruptions in the cluster, these disruptions are generating events, caught by origin monitor tests and failing IPsec upgrade CI lane and it may also cause noticeable temporary pod traffic failure during
upgrade for IPsec enabled cluster.

Hence the commit fcf67a9 adds Before=crio.service dependency on the ipsec.service so that pluto daemon is stopped after the shutdown of crio service, all pod containers are stopped on the node. This gives enough room for clients to
gracefully move to another control plane node for API connections.

Big thanks to @igsilya for helping to troubleshoot and fixing this problem.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pperiyasamy
Copy link
Member Author

/assign @trozet @igsilya

@pperiyasamy
Copy link
Member Author

/retest-required

@pperiyasamy
Copy link
Member Author

/retest

@pperiyasamy
Copy link
Member Author

/assign @yuqi-zhang

Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good
I just have some silly questions for my own understanding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great commit message and description of the problem made it easy for me to understand..
qq re:
The wait-for-ipsec-connect.service service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by the network operator.
I am new to MCO.. so is this a thing that if you add something in base template then during the main reboot of upgrade itself things will kick in versus I assume there is another path which acts like day2 that causes a second set of reboots? sorry for the stupid question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tssurya good question. if we define this service to be part of base template, then mco renders it as part of the final rendered machine config for the pool (so it will be single reboot for the node during upgrade), If we add it to be part of IPsec machine configs in CNO (https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/common/80-ipsec-master-extensions.yaml), then it causes two node reboots. one is for mco rendering updated IPsec machine configs on the node and then mco rendering its own final rendered machine configs. that's what we are avoiding here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you makes total sense!

if ! grep -q "auto=start" /etc/ipsec.d/openshift.conf; then
sed -i '/^.*conn ovn.*$/a\ auto=start' /etc/ipsec.d/openshift.conf
fi
chroot /proc/1/root ipsec restart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is what loads up all the connections based on ocp.conf file above?
question out of ignorance.. but ipsec here is "ipsec": {"NetworkManager-libreswan", "libreswan"},
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chroot /proc/1/root ipsec restart commad restarts ipsec.service systemd service which eventually rebooting pluto daemon (this loads up all the connections from the /etc/ipsec.d/opsenshift.conf file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ipsec == pluto gotcha

# Modify existing IPsec connection entries with "auto=start"
# option and restart ipsec systemd service. This helps to
# establish IKE SAs for the existing IPsec connections with
# peer nodes. This option will be deleted from connections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the new changes coming in ovs 3.5? where we plan to run it as a systemd process on the node?
so this fix is kinda temporary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is temporary fix for 4.19, when openvswitch-ipsec systemd service is consumed, then this can be removed and only 60s wait logic is needed.

# establish IKE SAs for the existing IPsec connections with
# peer nodes. This option will be deleted from connections
# once ovs-monitor-ipsec process spinned up on the node by
# ovn-ipsec-host pod, but still it won't reestablish IKE SAs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I assume that is also the case today when the ipsec-host pod comes up it sees what all connections are already established right and does nothing?
this is the script you gave me that does that in ovs: https://github.com/openvswitch/ovs/blob/main/ipsec/ovs-monitor-ipsec.in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the ovs-monitor-ipsec process just comes up and does nothing for existing peer nodes (except deleting auto=start parameter on every connection entries in /etc/ipsec.d/openshift.conf file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto=route is the default which means "The connection will be automatically started when traffic matching its policy is detected."
so that means given kubelet has not started we won't have IKEs established here.. hence we need auto=start.

chroot /proc/1/root ipsec restart

# Wait for upto 60s to get IPsec SAs to establish with peer nodes.
timeout=60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update commit msg/pr description to say 1min.. perhaps add your reasoning which you mentioned in the review comments for why you chose 1min ?

so the extra timewait of 1min per node (assuming reboots happening serially?) does that increase overall upgrades we need to be worried about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

The reboot within control plane nodes are serial.
The reboot within worker nodes are serial.
The reboot of worker and control plane nodes are happening in parallel.

on a 6 nodes cluster it took upto 2s (worst case 4s), 32 nodes cluster it took 4s (worst case 14s), so this may not increase upgrade time considering we have only 60s wait time in the script.

contents: |
[Unit]
Description=Ensure IKE SA established for existing IPsec connections.
After=ipsec.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the ipsec service starts first then we run this and then we restart ipsec service to load the connections?
im confused with the before/after semantics here..
so before=kubelet means this wait service will first run before kubelet starts right?
and ipsec service will run before wait service? i.e wait service will run after ipsec service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, right, this is the order of sequence for the service startup when node come up. while node is being shutdown, it follows reverse order for stopping the services.
because of auto=start change in the connection entries, we must restart pluto daemon to take that into effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha thank you

When node goes for a reboot on an IPsec enabled cluster, once it comes up,
after ovn-ipsec-host pod is deployed, the ovs-monitor-ipsec process deployed
by the pod parses /etc/ipsec.d/openshift.conf file, makes pluto daemon to
establish IKE SAs with peer nodes, but these are established after kubelet
is started, workload pods scheduled on this node would fail communicating
with other node pods until IPsec SAs are established.

So this commit adds wait-for-ipsec-connect.service systemd service which
depends on ipsecenabler.service created by IPsec machine config. This new
service loads existing IPsec connections created by OVN/OVS into pluto
daemon with "auto=start" option and waits upto 60s until IPsec tunnels
are established. This gives a better chance IPsec SAs are established even
before kubelet is started and when ovn-ipsec-host pod comes up later, it
doesn't have to do anything for existing IPsec connections.

We derived total wait time 60s based on the testing from 6 and 32 nodes
cluster. In 6 node cluster, it took mostly about 2s (in worst case 4s),
In 32 node cluster, it took mostly about 2s or 4s (in worst case 14s)
to get IPsec connections up with peer nodes.

The wait-for-ipsec-connect.service service is added into the base template
to avoid two reboots during upgrade if it goes into IPsec machine configs
rendered by the network operator.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
We noticed pluto is tearing down established IPsec connections in parallel
with crio stopping all pod containers which includes stopping api server
pod container. It happens when node reboot initiated for rendering new
machine configs at the time of OCP upgrade.
This creates api connection disruptions in the cluster, these disruptions are
generating events, caught by origin monitor tests and failing IPsec upgrade CI
lane and it may also cause noticeable temporary pod traffic failure during
upgrade for IPsec enabled cluster.
Hence this commit adds Before=crio.service dependency on the ipsec.service so
that pluto daemon is stopped after the shutdown of crio service, all pod
containers are stopped on the node. This gives enough room for clients to
gracefully move to another control plane node for API connections.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I'm happy with how it looks and my questions were answered.
I'll let @trozet take another pass for the approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2025
#!/bin/bash
set -x

if [ ! -e "/etc/ipsec.d/openshift.conf" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy on fresh cluster install I assume this will be the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tssurya right. this will be the case for ipsec external and disabled mode as well.

@pperiyasamy
Copy link
Member Author

/retest

@trozet
Copy link
Contributor

trozet commented Mar 17, 2025

/lgtm

@trozet
Copy link
Contributor

trozet commented Mar 17, 2025

@yuqi-zhang can you please approve?

@pperiyasamy pperiyasamy changed the title SDN-5330: Add ipsec connect wait service OCPBUGS-52280, SDN-5330: Add ipsec connect wait service Mar 17, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 17, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 17, 2025

@pperiyasamy: This pull request references Jira Issue OCPBUGS-52280, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references SDN-5330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

The IPsec upgrade CI job (e2e-aws-ovn-ipsec-upgrade) is never passing and failing with api connection disruptive events and some cluster operator are unavailable for more than allowed period. so we are addressing those issues
with this PR by fixing two things.

Add ipsec connect wait service

When node goes for a reboot on an IPsec enabled cluster, once it comes up, after ovn-ipsec-host pod is deployed, the ovs-monitor-ipsec process deployed by the pod parses /etc/ipsec.d/openshift.conf file, makes pluto daemon to
establish IKE SAs with peer nodes, but these are established after kubelet is started, workload pods scheduled on this node would fail communicating with other node pods until IPsec SAs are established.

So the commit e96fe31 adds wait-for-ipsec-connect.service systemd service which depends on ipsecenabler.service created by IPsec machine config. This new
service loads existing IPsec connections created by OVN/OVS into pluto daemon with "auto=start" option and waits upto 3 minutes until IPsec tunnels are established. This gives a better chance IPsec SAs are established even
before kubelet is started and when ovn-ipsec-host pod comes up later, it doesn't have to do anything for existing IPsec connections.

The wait-for-ipsec-connect.service service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by the network operator.

Add crio dependency for ipsec.service

We noticed pluto is tearing down established IPsec connections in parallel with crio stopping all pod containers which includes stopping api server pod container. It happens when node reboot initiated for rendering new
machine configs at the time of OCP upgrade.

This creates api connection disruptions in the cluster, these disruptions are generating events, caught by origin monitor tests and failing IPsec upgrade CI lane and it may also cause noticeable temporary pod traffic failure during
upgrade for IPsec enabled cluster.

Hence the commit fcf67a9 adds Before=crio.service dependency on the ipsec.service so that pluto daemon is stopped after the shutdown of crio service, all pod containers are stopped on the node. This gives enough room for clients to
gracefully move to another control plane node for API connections.

Big thanks to @igsilya for helping to troubleshoot and fixing this problem.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from anuragthehatter March 17, 2025 21:45
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically seems fine, the main thing that has bitten us in the past for similar services were systemd dependency loops, but as I understand it there is strict ordering here

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d7ef09e and 2 for PR HEAD 1fa5eaa in total

Copy link

@igsilya igsilya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the change looks fine to me. The auto=start part is a little dangerous as it increases chances for crossing streams at scale, so sooner we can get rid of it (by switching to running openviswitch-ipsec.service on the host) the better, but should be OK for now with the pinned Libreswan 4.6.

Copy link
Contributor

openshift-ci bot commented Mar 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: igsilya, pperiyasamy, trozet, tssurya, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1bffe82 and 1 for PR HEAD 1fa5eaa in total

Copy link
Contributor

openshift-ci bot commented Mar 18, 2025

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 07dd5a4 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-metal-ipi-ovn-dualstack 07dd5a4 link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/cluster-bootimages 07dd5a4 link true /test cluster-bootimages
ci/prow/e2e-azure 07dd5a4 link false /test e2e-azure
ci/prow/e2e-azure-ovn-upgrade 07dd5a4 link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-ovirt 07dd5a4 link false /test e2e-ovirt
ci/prow/okd-e2e-vsphere 07dd5a4 link false /test okd-e2e-vsphere
ci/prow/okd-e2e-upgrade 07dd5a4 link false /test okd-e2e-upgrade
ci/prow/okd-e2e-aws 07dd5a4 link false /test okd-e2e-aws
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade 07dd5a4 link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
ci/prow/e2e-openstack-parallel 07dd5a4 link false /test e2e-openstack-parallel
ci/prow/e2e-openstack-externallb 07dd5a4 link false /test e2e-openstack-externallb
ci/prow/e2e-aws-disruptive 07dd5a4 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn-rt-upgrade 07dd5a4 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-serial 07dd5a4 link false /test e2e-aws-serial
ci/prow/4.12-upgrade-from-stable-4.11-images 07dd5a4 link true /test 4.12-upgrade-from-stable-4.11-images
ci/prow/okd-e2e-gcp-op 07dd5a4 link false /test okd-e2e-gcp-op
ci/prow/e2e-aws-workers-rhel8 07dd5a4 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-ovirt-upgrade 07dd5a4 link false /test e2e-ovirt-upgrade
ci/prow/okd-images 07dd5a4 link true /test okd-images
ci/prow/e2e-gcp-single-node 07dd5a4 link false /test e2e-gcp-single-node
ci/prow/e2e-aws-upgrade-single-node 07dd5a4 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-ovn-workers-rhel8 07dd5a4 link false /test e2e-aws-ovn-workers-rhel8
ci/prow/e2e-gcp-op-ocl 1fa5eaa link false /test e2e-gcp-op-ocl
ci/prow/e2e-gcp-op-techpreview 1fa5eaa link false /test e2e-gcp-op-techpreview
ci/prow/okd-scos-e2e-aws-ovn 1fa5eaa link false /test okd-scos-e2e-aws-ovn
ci/prow/unit 1fa5eaa link true /test unit
ci/prow/e2e-hypershift 1fa5eaa link unknown /test e2e-hypershift
ci/prow/e2e-gcp-op 1fa5eaa link unknown /test e2e-gcp-op

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1bffe82 and 2 for PR HEAD 1fa5eaa in total

@openshift-merge-bot openshift-merge-bot bot merged commit d96d261 into openshift:main Mar 18, 2025
14 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: Jira Issue OCPBUGS-52280: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-52280 has not been moved to the MODIFIED state.

In response to this:

The IPsec upgrade CI job (e2e-aws-ovn-ipsec-upgrade) is never passing and failing with api connection disruptive events and some cluster operator are unavailable for more than allowed period. so we are addressing those issues
with this PR by fixing two things.

Add ipsec connect wait service

When node goes for a reboot on an IPsec enabled cluster, once it comes up, after ovn-ipsec-host pod is deployed, the ovs-monitor-ipsec process deployed by the pod parses /etc/ipsec.d/openshift.conf file, makes pluto daemon to
establish IKE SAs with peer nodes, but these are established after kubelet is started, workload pods scheduled on this node would fail communicating with other node pods until IPsec SAs are established.

So the commit e96fe31 adds wait-for-ipsec-connect.service systemd service which depends on ipsecenabler.service created by IPsec machine config. This new
service loads existing IPsec connections created by OVN/OVS into pluto daemon with "auto=start" option and waits upto 3 minutes until IPsec tunnels are established. This gives a better chance IPsec SAs are established even
before kubelet is started and when ovn-ipsec-host pod comes up later, it doesn't have to do anything for existing IPsec connections.

The wait-for-ipsec-connect.service service is added into the base template to avoid two reboots during upgrade if it goes into IPsec machine configs rendered by the network operator.

Add crio dependency for ipsec.service

We noticed pluto is tearing down established IPsec connections in parallel with crio stopping all pod containers which includes stopping api server pod container. It happens when node reboot initiated for rendering new
machine configs at the time of OCP upgrade.

This creates api connection disruptions in the cluster, these disruptions are generating events, caught by origin monitor tests and failing IPsec upgrade CI lane and it may also cause noticeable temporary pod traffic failure during
upgrade for IPsec enabled cluster.

Hence the commit fcf67a9 adds Before=crio.service dependency on the ipsec.service so that pluto daemon is stopped after the shutdown of crio service, all pod containers are stopped on the node. This gives enough room for clients to
gracefully move to another control plane node for API connections.

Big thanks to @igsilya for helping to troubleshoot and fixing this problem.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.19.0-202503182309.p0.gd96d261.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants