Better compatibility with in-place upgrades #491

israel-hdez · 2025-02-07T20:15:13Z

What this PR does / why we need it:

For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed.

The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control.

Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability.

By changing the strategy to Recreate, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes https://issues.redhat.com/browse/RHOAIENG-18977

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Manual validation: Having kserve-controller installed, scale to zero odh-operator. Then, modify kserve-controller Deployment to have strategy: Recreate. Despite the image remains unchanged, the new strategy will lead to the cluster re-deploying kserve-controller pod. Observe that, this time, the old pod will be removed and, later, the new pod will be started.

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed. The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control. Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability. By changing the `strategy` to `Recreate`, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability. Signed-off-by: Edgar Hernández <[email protected]>

openshift-ci · 2025-02-07T20:15:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [israel-hdez]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

israel-hdez · 2025-02-07T20:31:08Z

Sent to upstream: kserve#4234

spolti · 2025-02-07T21:30:47Z

config/manager/manager.yaml

@@ -12,6 +12,9 @@ spec:
    matchLabels:
      control-plane: kserve-controller-manager
      controller-tools.k8s.io: "1.0"
+  strategy:
+    type: Recreate
+    rollingUpdate: nil


is this nil field needed?
I guess there is no difference between being nil or not set.

I tried without it, and odh-operator gives an error during upgrade.
It tries to apply manifests, but the old deployment in the cluster is defaulted. So, it rejects the update because for Recreate you need to omit rollingUpdate.

I found that by setting to nil, the upgrade would succeed.

hum, but with nil, seems we cannot eve deploy kserve-controller any more.

what error does the odh operator throws ?
does it works with kubectl ?

Indeed, I'm having issues with the upstream PR. Their CI is using kubectl and it doesn't like the nil.

I remember the operator is doing a PATCH command which seems to work OK with the nil, but I understand it doesn't work with the CREATE one.

I'll check if it would work with {}

@israel-hdez somehow related, shouldn't enabling leader election solve the original problem ?

@israel-hdez ^ :D

@lburgazzoli @zdtsw Let me try the leader election. This was also suggested in the upstream PR.

maybe try to remove the rolingupdate:nil first with a PR
so operator wont get blocked
then you can continue with leader election? :D

@zdtsw all done in one PR: #492

mholder6 · 2025-02-07T21:52:15Z

/lgtm

…)" This reverts commit 2a917ea.

…492) * Revert "Better compatibility with in-place upgrades (#491)" This reverts commit 2a917ea. * Use leader election for better compatibility with in-place upgrades The previous commit reverts 2a917e, which changed the configuration of the Deployment for the manager to use `Recreate` strategy. Such change was for better compatibility for in-place upgrades (see the commit message of the reverted one). Instead of changing the Deployment strategy, this commit is enabling leader election in the manager. The leader election also solves the issues mentioned in commit 2a917e. Also, not changing the Deployment strategy works better with odh-operator. Signed-off-by: Edgar Hernández <[email protected]> --------- Signed-off-by: Edgar Hernández <[email protected]>

israel-hdez requested review from hdefazio and mholder6 February 7, 2025 20:15

openshift-ci bot added the approved label Feb 7, 2025

spolti reviewed Feb 7, 2025

View reviewed changes

openshift-ci bot assigned mholder6 Feb 7, 2025

openshift-ci bot added the lgtm label Feb 7, 2025

openshift-merge-bot bot merged commit 2a917ea into opendatahub-io:release-v0.14 Feb 7, 2025
20 checks passed

zdtsw mentioned this pull request Feb 8, 2025

[wip]update: revmoe rollinupdate if it exist when recreate is used for deployment opendatahub-io/opendatahub-operator#1634

Closed

5 tasks

israel-hdez added a commit to israel-hdez/kserve that referenced this pull request Feb 10, 2025

Revert "Better compatibility with in-place upgrades (opendatahub-io#491…

f76b8d6

…)" This reverts commit 2a917ea.

israel-hdez mentioned this pull request Feb 10, 2025

Use leader election for better compatibility with in-place upgrades #492

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better compatibility with in-place upgrades #491

Better compatibility with in-place upgrades #491

israel-hdez commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

israel-hdez commented Feb 7, 2025

spolti Feb 7, 2025

israel-hdez Feb 7, 2025

zdtsw Feb 8, 2025

lburgazzoli Feb 8, 2025

israel-hdez Feb 8, 2025

lburgazzoli Feb 10, 2025 •

edited

Loading

zdtsw Feb 10, 2025

israel-hdez Feb 10, 2025

zdtsw Feb 10, 2025

israel-hdez Feb 10, 2025

mholder6 commented Feb 7, 2025

Better compatibility with in-place upgrades #491

Better compatibility with in-place upgrades #491

Conversation

israel-hdez commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

israel-hdez commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lburgazzoli Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mholder6 commented Feb 7, 2025

lburgazzoli Feb 10, 2025 •

edited

Loading