-
Notifications
You must be signed in to change notification settings - Fork 89
demonstrate how to check if a node is stuck draining #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adds a StuckNode condition on the ManageUpgradeOperator. This is slightly more than simply: has it taken a while. It does the following 1. have I observed the same node draining for more than a period of time 2. if I have, report the following 1. report all still running pods with very long termination grace period seconds 2. report all still running pods protected by PDBs that are preventing eviction and indication which PDBs are preventing eviction. We have choices about whether to report pods with long termination grace period second under StuckNode or report them under a different condition since we know they will eventually be cleaned up.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #514 +/- ##
==========================================
- Coverage 54.22% 52.92% -1.31%
==========================================
Files 123 124 +1
Lines 6124 6275 +151
==========================================
Hits 3321 3321
- Misses 2599 2750 +151
Partials 204 204
🚀 New features to boost your workflow:
|
@deads2k: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
continue | ||
} | ||
// the MCO uses this annotation with a suffix of drain to indicate draining. | ||
if !strings.HasSuffix(node.Annotations["machineconfiguration.openshift.io/desiredDrain"], "drain") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking in case this is too new for our management clusters:
$ for Y in 10 11; do git --no-pager grep 'machineconfiguration.openshift.io/desiredDrain' "origin/release-4.${Y}"; done
origin/release-4.11:pkg/daemon/constants/constants.go: DesiredDrainerAnnotationKey = "machineconfiguration.openshift.io/desiredDrain"
So all 4.(y>=11) management clusters should have an MCO that sets this annotation, and our management clusters are all 4.15 or 4.16 or newer or something. So should be fine as you have it, but commenting in case other folks also wonder about this.
} | ||
|
||
// TODO option on caching, might not matter | ||
allPDBsInNamespace, err := kubeClient.PolicyV1().PodDisruptionBudgets(pod.Namespace).List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's likely some overlap between this kind of check and what the managed-upgrade operator is already doing, but I'll leave it up to actual MUO maintainers to decide if/how to integrate this with their existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We have something similar as part of nodekeeper controller logic. I am thinking if we can merge these.
Also, we did disable these drain strategy for managenent cluster to identify if there is any issue with the workload which we are running.
// +patchStrategy=merge | ||
// +listType=map | ||
// +listMapKey=type | ||
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @deads2k.
Currently one of the PR check if failing because CRD is not generated after this change. We are using controller-gen
for generating CRD:
make container-generate
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a7vicky, deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adds a StuckNode condition on the ManageUpgradeOperator. This is slightly more than simply: has it taken a while. It does the following
We have choices about whether to report pods with long termination grace period second under StuckNode or report them under a different condition since we know they will eventually be cleaned up.
Assuming we agree on the utility of change, things left to do: