Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48469: Fix CoreDNS static pod bring-up on cloud platforms #4830

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sadasu
Copy link
Contributor

@sadasu sadasu commented Feb 3, 2025

Fixes: OCPBUGS-48469

- What I did

  1. Updated CoreDNS Corefile errors for cloud platforms needing alternate in-cluster DNS when UserProvisionedDNS is enabled via the install-config
  2. Updated the list of directories to include the location of the CoreDNS files for cloud platforms
  3. Updated test data and unit tests for UserProvisionedDNS enabled on GCP.

- How to verify it
Set UserProvisionedDNS to Enabled for GCP via install-config and start installation

- Description for the changelog
Fixed issues with CoreDNS Corefile and template path for cloud platforms when UserProvisionedDNS is enabled.
Added test_data for GCP with all the UserProvisionedDNS configuration to better test this path.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025
@sadasu sadasu changed the title WIP: Fix dot within cloud platform coredns Corefile WIP: Fix CoreDNS static pod bring-up on cloud platforms (GCP and AWS) Feb 4, 2025
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch 2 times, most recently from 71cc2fd to 34e381d Compare February 5, 2025 17:16
@sadasu sadasu changed the title WIP: Fix CoreDNS static pod bring-up on cloud platforms (GCP and AWS) GCP: Fix issues and update tests when UserProvisionedDNS is enabled Feb 5, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2025
@sadasu sadasu changed the title GCP: Fix issues and update tests when UserProvisionedDNS is enabled OCPBUGS-48469: Fix issues and update tests when UserProvisionedDNS is enabled Feb 5, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 5, 2025
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-48469, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianli-wei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jianli-wei February 5, 2025 17:18
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch 2 times, most recently from 513164e to 1bb98dc Compare February 5, 2025 18:33
@sadasu sadasu changed the title OCPBUGS-48469: Fix issues and update tests when UserProvisionedDNS is enabled OCPBUGS-48469: Fix CoreDNS static pod bring-up on cloud platforms when UserProvisionedDNS is enabled Feb 5, 2025
@sadasu
Copy link
Contributor Author

sadasu commented Feb 5, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 5, 2025
@sadasu
Copy link
Contributor Author

sadasu commented Feb 5, 2025

/retest

// If this is a cloud platform with DNSType set to `ClusterHosted` with
// LB IPs provided, include path for their CoreDNS files
if cloudPlatformLoadBalancerIPState(*config) == availableLBIPState {
platformBasedPaths = append(platformBasedPaths, cloudPlatformAltDNS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this, was anything processing the templates in cloud-platform-alt-dns ? Was is being done in bootstrap only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was being done on the Bootstrap node successfully and while debugging for the CoreDNS pod not starting on the master nodes, I found this issue.

@sadasu
Copy link
Contributor Author

sadasu commented Feb 6, 2025

/retest-required

@mkowalski
Copy link
Contributor

mkowalski commented Feb 6, 2025

Ok from on-prem team. MCO can do the honours of merging if the order of merging templates makes sense

@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch from 1bb98dc to 3a4599e Compare February 7, 2025 00:55
Copy link
Contributor

@mkowalski mkowalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One new comment for the changed invocation of runtimecfg.

Plus a test will be good. MCO has all what's needed to test if the rendered config looks good.

@sadasu sadasu changed the title OCPBUGS-48469: Fix CoreDNS static pod bring-up on cloud platforms when UserProvisionedDNS is enabled OCPBUGS-48469: Fix CoreDNS static pod bring-up on cloud platforms Feb 8, 2025
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch 4 times, most recently from e8004f2 to f8223aa Compare February 13, 2025 05:07
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2025
Copy link
Contributor

openshift-ci bot commented Mar 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, sadasu, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 13ad337 and 2 for PR HEAD 8faf96e in total

@sadasu
Copy link
Contributor Author

sadasu commented Mar 18, 2025

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d7ef09e and 1 for PR HEAD 8faf96e in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1bffe82 and 0 for PR HEAD 8faf96e in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision 8faf96e was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2025
@sadasu
Copy link
Contributor Author

sadasu commented Mar 18, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2025
@sadasu
Copy link
Contributor Author

sadasu commented Mar 18, 2025

/retest-required

ci/prow/e2e-gcp-op-single-node has passed earlier and no code changes were made after that.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d96d261 and 2 for PR HEAD 8faf96e in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 684886b and 2 for PR HEAD 8faf96e in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fe8353e and 1 for PR HEAD 8faf96e in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fe8353e and 2 for PR HEAD 8faf96e in total

@sadasu
Copy link
Contributor Author

sadasu commented Mar 19, 2025

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2025
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch from 8faf96e to b44669a Compare March 19, 2025 18:54
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2025
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

New changes are detected. LGTM label has been removed.

@sadasu
Copy link
Contributor Author

sadasu commented Mar 19, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2025
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch from b44669a to ace725e Compare March 19, 2025 20:10
Provide path to resolv.conf file on the master node.
cluster-config.yaml file is not present on the master nodes. So,
explicitly pass in platformType to baremetal-runtimecfg containers
in the cloud platform CoreDNS pod.
@sadasu sadasu force-pushed the fix-cloud-platform-corefile branch from ace725e to 0a425b0 Compare March 20, 2025 02:51
@sadasu
Copy link
Contributor Author

sadasu commented Mar 20, 2025

/test e2e-hypershift

Copy link
Contributor

openshift-ci bot commented Mar 20, 2025

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-externallb 8faf96e link false /test e2e-openstack-externallb
ci/prow/e2e-ovirt 8faf96e link false /test e2e-ovirt
ci/prow/okd-e2e-aws 8faf96e link false /test okd-e2e-aws
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade 8faf96e link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
ci/prow/okd-e2e-gcp-op 8faf96e link false /test okd-e2e-gcp-op
ci/prow/e2e-aws-disruptive 8faf96e link false /test e2e-aws-disruptive
ci/prow/e2e-aws-single-node 8faf96e link false /test e2e-aws-single-node
ci/prow/okd-e2e-vsphere 8faf96e link false /test okd-e2e-vsphere
ci/prow/e2e-ovirt-upgrade 8faf96e link false /test e2e-ovirt-upgrade
ci/prow/okd-images 8faf96e link true /test okd-images
ci/prow/e2e-aws-upgrade-single-node 8faf96e link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-gcp-single-node 8faf96e link false /test e2e-gcp-single-node
ci/prow/e2e-aws-ovn-workers-rhel8 8faf96e link false /test e2e-aws-ovn-workers-rhel8
ci/prow/4.12-upgrade-from-stable-4.11-images 8faf96e link true /test 4.12-upgrade-from-stable-4.11-images
ci/prow/e2e-aws-workers-rhel8 8faf96e link false /test e2e-aws-workers-rhel8
ci/prow/e2e-azure-ovn-upgrade 8faf96e link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-aws-serial 8faf96e link false /test e2e-aws-serial
ci/prow/cluster-bootimages 8faf96e link true /test cluster-bootimages
ci/prow/e2e-openstack-parallel 8faf96e link false /test e2e-openstack-parallel
ci/prow/e2e-metal-ipi-ovn-ipv6 8faf96e link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/okd-e2e-upgrade 8faf96e link false /test okd-e2e-upgrade
ci/prow/e2e-gcp-op-ocl 0a425b0 link false /test e2e-gcp-op-ocl
ci/prow/e2e-azure-ovn-upgrade-out-of-change 0a425b0 link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sadasu
Copy link
Contributor Author

sadasu commented Mar 21, 2025

Results of pre-merge testing with openshift/baremetal-runtimecfg#345.

  1. CoreDNS pod started successfully on masters with Corefile:
time="2025-03-20T03:57:17Z" level=info msg="Adding 10.0.0.6 as DNS Upstream"
time="2025-03-20T03:57:17Z" level=info msg="Adding 169.254.169.254 as DNS Upstream"
time="2025-03-20T03:57:17Z" level=info msg=". {"
time="2025-03-20T03:57:17Z" level=info msg="    errors"
time="2025-03-20T03:57:17Z" level=info msg="    bufsize 512"
time="2025-03-20T03:57:17Z" level=info msg="    health :18080"
time="2025-03-20T03:57:17Z" level=info msg="    forward . 10.0.0.6 169.254.169.254 {"
time="2025-03-20T03:57:17Z" level=info msg="        policy sequential"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    cache 30"
time="2025-03-20T03:57:17Z" level=info msg="    reload"
time="2025-03-20T03:57:17Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match .*[.]apps.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 34.56.17.189\""
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match .*[.]apps.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        "
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match ^api.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 34.8.101.56\""
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match ^api.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        "
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match ^api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.0.0.2\""
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:17Z" level=info msg="        match ^api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:17Z" level=info msg="        "
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="    hosts {"
time="2025-03-20T03:57:17Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:17Z" level=info msg="    }"
time="2025-03-20T03:57:17Z" level=info msg="}"
time="2025-03-20T03:57:17Z" level=info
time="2025-03-20T03:57:17Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile

@sadasu
Copy link
Contributor Author

sadasu commented Mar 21, 2025

  1. CoreDNS pod continuing to run with updated Corefile:
time="2025-03-20T03:57:18Z" level=info msg="Adding 10.0.0.6 as DNS Upstream"
time="2025-03-20T03:57:18Z" level=info msg="Adding 169.254.169.254 as DNS Upstream"
time="2025-03-20T03:57:18Z" level=error msg="Failed to get node list: Get \"https://api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes\": dial tcp: lookup api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host"
time="2025-03-20T03:57:48Z" level=info msg="Adding 10.0.0.6 as DNS Upstream"
time="2025-03-20T03:57:48Z" level=info msg="Adding 169.254.169.254 as DNS Upstream"
time="2025-03-20T03:57:49Z" level=info msg="Node change detected, rendering Corefile" Node Addresses="[{10.0.0.5 jiwei-0320b-rlfw2-master-0 false} {10.0.0.3 jiwei-0320b-rlfw2-master-1 false} {10.0.0.6 jiwei-0320b-rlfw2-master-2 false}]"
time="2025-03-20T03:57:49Z" level=info msg=". {"
time="2025-03-20T03:57:49Z" level=info msg="    errors"
time="2025-03-20T03:57:49Z" level=info msg="    bufsize 512"
time="2025-03-20T03:57:49Z" level=info msg="    health :18080"
time="2025-03-20T03:57:49Z" level=info msg="    forward . 10.0.0.6 169.254.169.254 {"
time="2025-03-20T03:57:49Z" level=info msg="        policy sequential"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    cache 30"
time="2025-03-20T03:57:49Z" level=info msg="    reload"
time="2025-03-20T03:57:49Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match .*[.]apps.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 34.56.17.189\""
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match .*[.]apps.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        "
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match ^api.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 34.8.101.56\""
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match ^api.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        "
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    template IN A jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match ^api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.0.0.2\""
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    template IN AAAA jiwei-0320b.qe.gcp.devcluster.openshift.com {"
time="2025-03-20T03:57:49Z" level=info msg="        match ^api-int.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        "
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="    hosts {"
time="2025-03-20T03:57:49Z" level=info msg="        10.0.0.5 jiwei-0320b-rlfw2-master-0 jiwei-0320b-rlfw2-master-0.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        10.0.0.3 jiwei-0320b-rlfw2-master-1 jiwei-0320b-rlfw2-master-1.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        10.0.0.6 jiwei-0320b-rlfw2-master-2 jiwei-0320b-rlfw2-master-2.jiwei-0320b.qe.gcp.devcluster.openshift.com"
time="2025-03-20T03:57:49Z" level=info msg="        fallthrough"
time="2025-03-20T03:57:49Z" level=info msg="    }"
time="2025-03-20T03:57:49Z" level=info msg="}"
time="2025-03-20T03:57:49Z" level=info
time="2025-03-20T03:57:49Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile
``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants