-
Notifications
You must be signed in to change notification settings - Fork 167
QE/DNM; Gate ovn-controller start until ovnkube controller syncs #2626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: martinkennelly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
04700a1 to
e7c07a1
Compare
c7d93cc to
8628540
Compare
|
/retest |
|
/test ? |
|
@pperiyasamy: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn openshift/machine-config-operator#5123 |
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn-upgrade openshift/machine-config-operator#5123 |
8628540 to
da7f6f0
Compare
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn-upgrade openshift/machine-config-operator#5123 |
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn openshift/machine-config-operator#5123 |
da7f6f0 to
af5fdfd
Compare
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn openshift/cluster-network-operator#2722 |
af5fdfd to
81d769d
Compare
|
/test qe-perfscale-aws-ovn-small-cluster-density |
|
@martinkennelly: The specified target(s) for The following commands are available to trigger optional jobs: Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test perfscale-aws-ovn-small-cluster-density-v2 |
|
/test perfscale-aws-ovn-small-node-density-cni |
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn openshift/cluster-network-operator#2722 |
1 similar comment
|
/testwith openshift/ovn-kubernetes/master/e2e-aws-ovn openshift/cluster-network-operator#2722 |
Initial implementations erroneously assumed a CIDR for NATs logicalIP. Also, eip controller expects all OVN constructs that support EIP to have this metadata so if we cannot build this metadata then add dummy data so its cleaned up later by EIP controller. This was not caught by unit tests because the unit test also contained the assumption of only logical IP with no mask. It was not caught by upstream CI because we have no reboot tests. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 7588fd3)
The startup syncer was removing OVN constructs due to logic bugs introduced when EIP code was refactored for UDN. The are added again when eip controller syncs but this causes interruption. 1. Due to poor naming, enforcement of types and programmer error we were mixing up variables between a pod IP and an EIP IP. See: nodeName, ok := cache.egressIPIPToNodeCache[parsedLogicalIP.String()] parsedLogicalIP is a pod IP and not an EIP IP. 2. When iterating over the existing config for an EIP, we should delete config for LRPs where an EIP doesn't exist. 3. Remove LRPs when a network isnt found Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 68db55e)
…readability No func changes. Check if obj is nil post parsing IP. Improve logging of stale OVN config. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 41a9151)
Removes config for deleted nodes/pods while controller was down and ensures ovn config is removed while preserving valid config. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit fba4233d08052bb4753a6c15f496369a8f47b577)
For EIP, when a Node is assigned an EIP and if the EIP moves when ovnkube-controller is down and SB DB contains stale SNAT entries, then when ovn-controller starts again, it will GARP. Ideally, ovn would have support for suppressing GARPs for LRPs, but that is not available. I took a more general solution to fix the issue. For IC, gate ovn-controller until SB DB is not stale. The previous implementation of ovnkube-controller sync+startup depended on ovn-controller running. I removed this dependency. ovnkube controller writes a file when startup is complete and changes flow to SB DB. External entities can predicate on this file to determine if the ovnkube controller has sync. ovnkube controller will remove this file on exit. ovn-controller start is now gated on the presence of this file for IC. We do not support multi Node per Zone therefore ovnkube controller is co-located on the same name as ovn-controller. If theres a pod delete and recreate, ovn-controller will now not start until SB DB is not stale. Before, we accepted this and got incremental updates during the ovnkube controllers sync+start. if ovn-controller container restarts and ovnkube controller did not, ovn-controller starts because the SB DB hot file exists. if ovnkube-controller restarts and ovn-controller did not, then ovn-controller will continue running. Signed-off-by: Martin Kennelly <[email protected]>
81d769d to
191fd17
Compare
|
@martinkennelly: The specified target(s) for The following commands are available to trigger optional jobs: Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@martinkennelly: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
closed in favor of #2664 |
For IC mode, after a Node reboot, SB DB may contain stale control plane data because ovnkube-controller has not config'd OVN and changes propagated to SB DB. The issue we see downstream is that when an egress Node which hosts an EgressIP IP, and when it reboots, ovn-controller will connect to SB DB which will contain stale SNAT to support EgressIP. ovn-controller will GARP for the EgressIP IP even though the EgressIP IP may have already been assigned to another Node.
Therefore, gate ovn-controller until ovnkube controller has syncd and the changes have reached SB DB for IC mode only.
(cherry picked from commit 4a63750e4fce2f17597e665ddd2c7c2a73c3f82f)
📑 Description
Fixes #
Additional Information for reviewers
✅ Checks
How to verify it