-
Couldn't load subscription status.
- Fork 98
OCPBUGS-14346: Fix when DNS operator reports Degraded #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OCPBUGS-14346: Fix when DNS operator reports Degraded #373
Conversation
|
@candita: This pull request references Jira Issue OCPBUGS-14346, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e077885 to
e6cc811
Compare
|
/jira refresh |
|
@candita: This pull request references Jira Issue OCPBUGS-14346, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira refresh |
|
@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ad5a5f1 to
bfd73b8
Compare
|
@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @gcs278 |
bfd73b8 to
950992b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following discussion in #wg-operator-degraded-condition, but wanted to add this one comment initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to wrap my head around the issue
| name: "would return Degraded=ConditionTrue, but Degraded was set to false within tolerated duration, so returns Degraded=ConditionFalse", | ||
| clusterIP: "1.2.3.4", | ||
| dnsDaemonset: makeDaemonSet(6, 1, 6), | ||
| nrDaemonset: makeDaemonSet(6, 6, 6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing a unit test, as when I ran with both code coverage and setting a breakpoint, https://github.com/openshift/cluster-dns-operator/pull/373/files#diff-32495132facf7e0819a407af732514958b198d9e40236656bc43b093345d1539R116-R118 was never hit.
The existing test here "should return Degraded=ConditionTrue", doesn't test the test case for case want > have.
Test like this would do:
{
name: "should return Degraded=ConditionTrue > 1 available",
clusterIP: "1.2.3.4",
dnsDaemonset: makeDaemonSet(6, 1, 6),
nrDaemonset: makeDaemonSet(6, 6, 6),
oldCondition: operatorv1.OperatorCondition{
Type: operatorv1.OperatorStatusTypeDegraded,
Status: operatorv1.ConditionFalse,
LastTransitionTime: metav1.NewTime(time.Date(2022, time.Month(5), 19, 1, 9, 50, 0, time.UTC)),
},
currentTime: time.Date(2022, time.Month(5), 19, 1, 11, 50, 0, time.UTC),
// last-curr = 1m, tolerate 1m, so should prevent the flap.
toleration: 1 * time.Minute,
expected: operatorv1.ConditionFalse,
**},**
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took out my trusty truth table and this is definitely an issue. If it starts Degraded=True, it goes to Degraded=False and starts over. We want to keep it in Degraded=True in that case. I'll keep working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcs278 the example code you have here makes current time - last transition time = 2m, and the toleration is 1m, so the expected condition would be true. The link doesn't resolve anymore, can you tell me which breakpoint you're referring to?
|
/jira refresh |
|
@candita: This pull request references Jira Issue OCPBUGS-14346, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
|
/remove-lifecycle stale |
|
/tide refresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR seems good and sufficient. It fixes the root cause AND hardens the status reconciliation. To make sure I had the logic down I tried adding an e2e test, see below. Overall this seems like a sensible, conservative repair.
Good things:
- Centralizes the upgrade detection & uses that as needed: both the Progressing and Degraded logic.
- Avoids showing both
Progressing=TrueandDegraded=Trueduring upgrades:
The Operator-level degraded condition is suppressed while upgrading (operand, DNS-level degraded condition still exists). This follows OpenShift guidance and addresses the user-visible noise that motivated OCPBUGS-14346. - better surfacing of underlying DNS degraded reasons when not suppressing while upgrading.
- Hardens status reconciliation:
computeDNSStatusConditionsis now able to return an error; the reconciler treats some errors as retryable and requeues with backoff. This makes status sync more robust against transient read errors.
Stuff for possible follow-ups:
- PR introduces
pkg/util/retryableerrorwhich probably could be proposed addition tolibrary-goor our own central lib location. I can't find anything just like it on the interwebs yet it seems pretty useful generally. - Test changes / expectations: the PR changes unit test expectations (some tests previously expected Degraded=True while Progressing=True). That’s appropriate, but this is why I was thinking about adding an e2e to avoid operational surprises.
Trying to add an e2e to test this new behavior
- I was trying to simulate an upgrade-in-progress so DNS reports Progressing=True/Degraded=True and verify the operator shows Progressing=True but Degraded=False while the DNS CR still contains the degraded detail
- I backed off on this e2e for now because the operator immediately recomputes and overwrites any patched DNS.Status from real cluster state, so one-shot status patches are ephemeral; reliably producing a visible Progressing state requires making a real state change which is intrusive, timing-sensitive, and likely can get stuck in a race.
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rfredette The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@candita: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| if len(progressingReasons) != 0 { | ||
| progressingCondition.Status = status | ||
| progressingCondition.Reason = strings.Join(progressingReasons, "And") | ||
| progressingCondition.Reason = strings.Join(progressingReasons, "And ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bentito found this can produce a reason like DNSReportsProgressingIsTrueAnd Upgrading (extra space).
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade openshift/origin#30296 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dc28bfe0-a919-11f0-8524-9129321d16ea-0 |
|
This PR tested out (by me, against OCP 4.19) as a fix for https://issues.redhat.com/browse/OCPBUGS-62623 as well. |
|
In the last run, the update of /payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade openshift/origin#30296 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3908620-a94f-11f0-97bc-4bfd2ef5ba80-0 |
|
Same failure in the last run. Feel like something is wrong with the pull here. |


Don't allow the cluster operator status to be Progressing while Degraded.
Update when DNS operator reports Progressing. Formerly it compared the daemonset NumberAvailable to DesiredNumberScheduled for an available status and if they weren't equal, then compared UpdatedNumberScheduled to DesiredNumberScheduled for an up-to-date status. Now it only compares UpdatedNumberScheduled to DesiredNumberScheduled for an up-to-date status, and instead of requiring equality, it requires the updated to be greater than or equal to the desired number scheduled.
Fix mergeConditions lastTransitionTime updates.
Use the same heuristics on node resolver pod count as dns pod count.
Add unit test for computing degraded condition. Fix unit tests, especially those that expect Degraded to be true while Progressing is true, making sure that some observe a sense of time by adding variable previous conditions.