OCPBUGS-14346: Fix when DNS operator reports Degraded #373

candita · 2023-07-21T00:08:57Z

Don't allow the cluster operator status to be Progressing while Degraded.

Update when DNS operator reports Progressing. Formerly it compared the daemonset NumberAvailable to DesiredNumberScheduled for an available status and if they weren't equal, then compared UpdatedNumberScheduled to DesiredNumberScheduled for an up-to-date status. Now it only compares UpdatedNumberScheduled to DesiredNumberScheduled for an up-to-date status, and instead of requiring equality, it requires the updated to be greater than or equal to the desired number scheduled.

Fix mergeConditions lastTransitionTime updates.

Use the same heuristics on node resolver pod count as dns pod count.

Add unit test for computing degraded condition. Fix unit tests, especially those that expect Degraded to be true while Progressing is true, making sure that some observe a sense of time by adding variable previous conditions.

openshift-ci-robot · 2023-07-21T00:09:17Z

@candita: This pull request references Jira Issue OCPBUGS-14346, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

DNS Daemonset had maxUnavailable=0, so removed it from computing Degraded condition. If there are still more desired pods than what we have, only mark Degraded=false if the time period to wait has passed. Update unit tests, making sure they observe a sense of time by adding variable previous conditions.

When computing Degraded condition in the DNS controllerr, check first if upgrading and if so, don't mark Degraded=true. Add unit test for computing degraded condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

candita · 2023-07-21T00:44:45Z

/jira refresh

openshift-ci-robot · 2023-07-21T00:44:47Z

@candita: This pull request references Jira Issue OCPBUGS-14346, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.14" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

candita · 2023-07-21T00:45:18Z

/jira refresh

openshift-ci-robot · 2023-07-21T00:45:24Z

@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-07-25T22:31:24Z

@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

DNS Daemonset is configured with rolliing upgrade strategy maxUnavailable=0, so removed it from computing Degraded condition and removed maxUnavailable validation. If there are still more desired pods than what we have, only mark Degraded=false if the time period to wait has passed. Update unit tests, making sure they observe a sense of time by adding variable previous conditions.

When computing Degraded condition in the DNS controller, check first if upgrading and if so, don't mark Degraded=true. Also, mark Degraded=true only if it hasn't recently transitioned. Add unit test for computing degraded condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

candita · 2023-07-26T16:24:32Z

/assign @gcs278

gcs278

I'm following discussion in #wg-operator-degraded-condition, but wanted to add this one comment initially.

pkg/operator/controller/dns_status.go

candita · 2023-07-26T22:30:51Z

e2e-aws-ovn-upgrade operators-timeline reports the amount of time spent in Degraded/Progressing for dns operators:

Before the changes 2m50s Degraded and 10m36s Progressing:

After the changes 0s Degraded and 8m23s Progressing:

pkg/operator/controller/dns_status.go

gcs278

Still trying to wrap my head around the issue

gcs278 · 2023-07-31T20:10:26Z

pkg/operator/controller/dns_status_test.go

+			name:         "would return Degraded=ConditionTrue, but Degraded was set to false within tolerated duration, so returns Degraded=ConditionFalse",
+			clusterIP:    "1.2.3.4",
+			dnsDaemonset: makeDaemonSet(6, 1, 6),
+			nrDaemonset:  makeDaemonSet(6, 6, 6),


I think you are missing a unit test, as when I ran with both code coverage and setting a breakpoint, https://github.com/openshift/cluster-dns-operator/pull/373/files#diff-32495132facf7e0819a407af732514958b198d9e40236656bc43b093345d1539R116-R118 was never hit.

The existing test here "should return Degraded=ConditionTrue", doesn't test the test case for case want > have.
Test like this would do:

{ name: "should return Degraded=ConditionTrue > 1 available", clusterIP: "1.2.3.4", dnsDaemonset: makeDaemonSet(6, 1, 6), nrDaemonset: makeDaemonSet(6, 6, 6), oldCondition: operatorv1.OperatorCondition{ Type: operatorv1.OperatorStatusTypeDegraded, Status: operatorv1.ConditionFalse, LastTransitionTime: metav1.NewTime(time.Date(2022, time.Month(5), 19, 1, 9, 50, 0, time.UTC)), }, currentTime: time.Date(2022, time.Month(5), 19, 1, 11, 50, 0, time.UTC), // last-curr = 1m, tolerate 1m, so should prevent the flap. toleration: 1 * time.Minute, expected: operatorv1.ConditionFalse, **},**

I took out my trusty truth table and this is definitely an issue. If it starts Degraded=True, it goes to Degraded=False and starts over. We want to keep it in Degraded=True in that case. I'll keep working.

@gcs278 the example code you have here makes current time - last transition time = 2m, and the toleration is 1m, so the expected condition would be true. The link doesn't resolve anymore, can you tell me which breakpoint you're referring to?

pkg/operator/controller/dns_status.go

candita · 2025-03-31T14:56:15Z

/jira refresh

openshift-ci-robot · 2025-03-31T14:56:21Z

@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-06-30T01:00:36Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2025-07-30T08:30:07Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

candita · 2025-08-01T02:46:09Z

/remove-lifecycle rotten

candita · 2025-08-01T02:46:24Z

/remove-lifecycle stale

candita · 2025-08-28T22:31:48Z

/tide refresh

candita · 2025-09-17T15:07:00Z

@Miciah Brett has volunteered to take a look, so I'm going to take you off the hook on this. 🙂
/assign @bentito
/unassign @Miciah

bentito

The PR seems good and sufficient. It fixes the root cause AND hardens the status reconciliation. To make sure I had the logic down I tried adding an e2e test, see below. Overall this seems like a sensible, conservative repair.

Good things:

Centralizes the upgrade detection & uses that as needed: both the Progressing and Degraded logic.
Avoids showing both Progressing=True and Degraded=True during upgrades:
The Operator-level degraded condition is suppressed while upgrading (operand, DNS-level degraded condition still exists). This follows OpenShift guidance and addresses the user-visible noise that motivated OCPBUGS-14346.
better surfacing of underlying DNS degraded reasons when not suppressing while upgrading.
Hardens status reconciliation: computeDNSStatusConditions is now able to return an error; the reconciler treats some errors as retryable and requeues with backoff. This makes status sync more robust against transient read errors.

Stuff for possible follow-ups:

PR introduces pkg/util/retryableerror which probably could be proposed addition to library-go or our own central lib location. I can't find anything just like it on the interwebs yet it seems pretty useful generally.
Test changes / expectations: the PR changes unit test expectations (some tests previously expected Degraded=True while Progressing=True). That’s appropriate, but this is why I was thinking about adding an e2e to avoid operational surprises.

Trying to add an e2e to test this new behavior

I was trying to simulate an upgrade-in-progress so DNS reports Progressing=True/Degraded=True and verify the operator shows Progressing=True but Degraded=False while the DNS CR still contains the degraded detail
I backed off on this e2e for now because the operator immediately recomputes and overwrites any patched DNS.Status from real cluster state, so one-shot status patches are ephemeral; reliably producing a visible Progressing state requires making a real state change which is intrusive, timing-sensitive, and likely can get stuck in a race.

rfredette · 2025-10-07T21:24:37Z

/approve

openshift-ci · 2025-10-07T21:24:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfredette

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rfredette]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rfredette · 2025-10-07T21:27:12Z

/retest-required

openshift-ci · 2025-10-08T01:03:54Z

@candita: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify-deps	`d6ac7a7`	link	true	`/test verify-deps`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

candita · 2025-10-09T19:19:15Z

pkg/operator/controller/status/controller.go

 	if len(progressingReasons) != 0 {
 		progressingCondition.Status = status
-		progressingCondition.Reason = strings.Join(progressingReasons, "And")
+		progressingCondition.Reason = strings.Join(progressingReasons, "And ")


@bentito found this can produce a reason like DNSReportsProgressingIsTrueAnd Upgrading (extra space).

hongkailiu · 2025-10-14T16:21:37Z

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade openshift/origin#30296

openshift-ci · 2025-10-14T16:21:40Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dc28bfe0-a919-11f0-8524-9129321d16ea-0

bentito · 2025-10-14T17:57:26Z

This PR tested out (by me, against OCP 4.19) as a fix for https://issues.redhat.com/browse/OCPBUGS-62623 as well.

hongkailiu · 2025-10-14T22:47:55Z

In the last run, the update of co/dns seems blocked. Rerunning ...

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade openshift/origin#30296

openshift-ci · 2025-10-14T22:48:00Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3908620-a94f-11f0-97bc-4bfd2ef5ba80-0

hongkailiu · 2025-10-15T10:40:00Z

Same failure in the last run. Feel like something is wrong with the pull here.

candita changed the title ~~OCPBUGS-14346 Fix when DNS operator reports Degraded.~~ OCPBUGS-14346: Fix when DNS operator reports Degraded. Jul 21, 2023

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jul 21, 2023

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 21, 2023

openshift-ci bot requested review from Miciah and frobware July 21, 2023 00:09

candita force-pushed the OCPBUGS-14346-progressingDegraded branch from e077885 to e6cc811 Compare July 21, 2023 00:35

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 21, 2023

openshift-ci bot requested a review from melvinjoseph86 July 21, 2023 00:45

candita force-pushed the OCPBUGS-14346-progressingDegraded branch 5 times, most recently from ad5a5f1 to bfd73b8 Compare July 25, 2023 22:18

openshift-ci bot assigned gcs278 Jul 26, 2023

candita force-pushed the OCPBUGS-14346-progressingDegraded branch from bfd73b8 to 950992b Compare July 26, 2023 21:29

gcs278 reviewed Jul 26, 2023

View reviewed changes

pkg/operator/controller/dns_status.go Outdated Show resolved Hide resolved

gcs278 reviewed Jul 26, 2023

View reviewed changes

pkg/operator/controller/dns_status.go Outdated Show resolved Hide resolved

gcs278 reviewed Jul 31, 2023

View reviewed changes

candita changed the title ~~OCPBUGS-14346: Fix when DNS operator reports Degraded.~~ [WIP] OCPBUGS-14346: Fix when DNS operator reports Degraded. Aug 3, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 31, 2025

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2025

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2025

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 1, 2025

openshift-ci bot assigned bentito and unassigned Miciah Sep 17, 2025

bentito approved these changes Sep 19, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025

candita commented Oct 9, 2025

View reviewed changes

openshift-ci bot mentioned this pull request Oct 14, 2025

OTA-1637: ClusterOperators should not go Progressing only for a node reboot openshift/origin#30296

Merged

Uh oh!

OCPBUGS-14346: Fix when DNS operator reports Degraded #373

Are you sure you want to change the base?

OCPBUGS-14346: Fix when DNS operator reports Degraded #373

Uh oh!

Conversation

candita commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 21, 2023

Uh oh!

candita commented Jul 21, 2023

Uh oh!

openshift-ci-robot commented Jul 21, 2023

Uh oh!

candita commented Jul 21, 2023

Uh oh!

openshift-ci-robot commented Jul 21, 2023

Uh oh!

openshift-ci-robot commented Jul 25, 2023

Uh oh!

candita commented Jul 26, 2023

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

candita commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

gcs278 Jul 31, 2023

Choose a reason for hiding this comment

Uh oh!

candita Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

candita Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

candita commented Mar 31, 2025

Uh oh!

openshift-ci-robot commented Mar 31, 2025

Uh oh!

openshift-bot commented Jun 30, 2025

Uh oh!

openshift-bot commented Jul 30, 2025

Uh oh!

candita commented Aug 1, 2025

Uh oh!

candita commented Aug 1, 2025

Uh oh!

candita commented Aug 28, 2025

Uh oh!

candita commented Sep 17, 2025

Uh oh!

bentito left a comment

Choose a reason for hiding this comment

Good things:

Stuff for possible follow-ups:

Trying to add an e2e to test this new behavior

Uh oh!

rfredette commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

rfredette commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

candita Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Oct 14, 2025

Uh oh!

openshift-ci bot commented Oct 14, 2025

candita commented Jul 21, 2023 •

edited

Loading

candita commented Jul 26, 2023 •

edited

Loading