-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better compatibility with in-place upgrades #491
Better compatibility with in-place upgrades #491
Conversation
For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed. The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control. Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability. By changing the `strategy` to `Recreate`, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability. Signed-off-by: Edgar Hernández <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: israel-hdez The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sent to upstream: kserve#4234 |
@@ -12,6 +12,9 @@ spec: | |||
matchLabels: | |||
control-plane: kserve-controller-manager | |||
controller-tools.k8s.io: "1.0" | |||
strategy: | |||
type: Recreate | |||
rollingUpdate: nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this nil field needed?
I guess there is no difference between being nil or not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried without it, and odh-operator gives an error during upgrade.
It tries to apply manifests, but the old deployment in the cluster is defaulted. So, it rejects the update because for Recreate
you need to omit rollingUpdate
.
I found that by setting to nil, the upgrade would succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum, but with nil, seems we cannot eve deploy kserve-controller any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what error does the odh operator throws ?
does it works with kubectl ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'm having issues with the upstream PR. Their CI is using kubectl and it doesn't like the nil.
I remember the operator is doing a PATCH command which seems to work OK with the nil, but I understand it doesn't work with the CREATE one.
I'll check if it would work with {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@israel-hdez somehow related, shouldn't enabling leader election solve the original problem ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@israel-hdez ^ :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lburgazzoli @zdtsw Let me try the leader election. This was also suggested in the upstream PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe try to remove the rolingupdate:nil first with a PR
so operator wont get blocked
then you can continue with leader election? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm |
2a917ea
into
opendatahub-io:release-v0.14
…)" This reverts commit 2a917ea.
…492) * Revert "Better compatibility with in-place upgrades (#491)" This reverts commit 2a917ea. * Use leader election for better compatibility with in-place upgrades The previous commit reverts 2a917e, which changed the configuration of the Deployment for the manager to use `Recreate` strategy. Such change was for better compatibility for in-place upgrades (see the commit message of the reverted one). Instead of changing the Deployment strategy, this commit is enabling leader election in the manager. The leader election also solves the issues mentioned in commit 2a917e. Also, not changing the Deployment strategy works better with odh-operator. Signed-off-by: Edgar Hernández <[email protected]> --------- Signed-off-by: Edgar Hernández <[email protected]>
What this PR does / why we need it:
For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed.
The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control.
Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability.
By changing the
strategy
toRecreate
, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes https://issues.redhat.com/browse/RHOAIENG-18977
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Manual validation: Having kserve-controller installed, scale to zero odh-operator. Then, modify kserve-controller Deployment to have
strategy: Recreate
. Despite the image remains unchanged, the new strategy will lead to the cluster re-deploying kserve-controller pod. Observe that, this time, the old pod will be removed and, later, the new pod will be started.Checklist: