Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better compatibility with in-place upgrades #491

Conversation

israel-hdez
Copy link

What this PR does / why we need it:

For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed.

The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control.

Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability.

By changing the strategy to Recreate, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes https://issues.redhat.com/browse/RHOAIENG-18977

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Manual validation: Having kserve-controller installed, scale to zero odh-operator. Then, modify kserve-controller Deployment to have strategy: Recreate. Despite the image remains unchanged, the new strategy will lead to the cluster re-deploying kserve-controller pod. Observe that, this time, the old pod will be removed and, later, the new pod will be started.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

For in-place upgrades the manifests of the new version are applied in a cluster where KServe may already be installed. The old version of the kserve-controller is terminated and the new version is deployed.

The default strategy is Rolling. This has the consequence that during an in-place upgrade, there will be two different versions of kserve-controller running in parallel. Most importantly, the two instances will be running with different configs, which include references to different versions of kserve-agent, kserve-router, storage-initializer, etc. During an in-place upgrade, the new version of the controller will upgrade the deployed models to use the new versions of the images. However, since the old version of kserve-controller is running in parallel, this old version will do rollback. The two instances will conflict with each other, leading to Deployments/KSVCs being changed very quickly which will lead to the cluster spawning pods without control.

Eventually, the cluster will terminate the old version of kserve-controller. However, because of the very, very quick updates to deployments the resources of the cluster can be exhausted leading to instability.

By changing the `strategy` to `Recreate`, during an in-place upgrade the cluster will make sure that the old version of kserve-controller is fully terminated before starting the new version. This prevents cluster instability.

Signed-off-by: Edgar Hernández <[email protected]>
Copy link

openshift-ci bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 7, 2025
@israel-hdez
Copy link
Author

Sent to upstream: kserve#4234

@@ -12,6 +12,9 @@ spec:
matchLabels:
control-plane: kserve-controller-manager
controller-tools.k8s.io: "1.0"
strategy:
type: Recreate
rollingUpdate: nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this nil field needed?
I guess there is no difference between being nil or not set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried without it, and odh-operator gives an error during upgrade.
It tries to apply manifests, but the old deployment in the cluster is defaulted. So, it rejects the update because for Recreate you need to omit rollingUpdate.

I found that by setting to nil, the upgrade would succeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, but with nil, seems we cannot eve deploy kserve-controller any more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what error does the odh operator throws ?
does it works with kubectl ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'm having issues with the upstream PR. Their CI is using kubectl and it doesn't like the nil.

I remember the operator is doing a PATCH command which seems to work OK with the nil, but I understand it doesn't work with the CREATE one.

I'll check if it would work with {}

Copy link

@lburgazzoli lburgazzoli Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@israel-hdez somehow related, shouldn't enabling leader election solve the original problem ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lburgazzoli @zdtsw Let me try the leader election. This was also suggested in the upstream PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe try to remove the rolingupdate:nil first with a PR
so operator wont get blocked
then you can continue with leader election? :D

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdtsw all done in one PR: #492

@mholder6
Copy link

mholder6 commented Feb 7, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 7, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 2a917ea into opendatahub-io:release-v0.14 Feb 7, 2025
20 checks passed
israel-hdez added a commit to israel-hdez/kserve that referenced this pull request Feb 10, 2025
openshift-merge-bot bot pushed a commit that referenced this pull request Feb 11, 2025
…492)

* Revert "Better compatibility with in-place upgrades (#491)"

This reverts commit 2a917ea.

* Use leader election for better compatibility with in-place upgrades

The previous commit reverts 2a917e, which changed the configuration of the Deployment for the manager to use `Recreate` strategy. Such change was for better compatibility for in-place upgrades (see the commit message of the reverted one).

Instead of changing the Deployment strategy, this commit is enabling leader election in the manager. The leader election also solves the issues mentioned in commit 2a917e. Also, not changing the Deployment strategy works better with odh-operator.

Signed-off-by: Edgar Hernández <[email protected]>

---------

Signed-off-by: Edgar Hernández <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants