Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use leader election for better compatibility with in-place upgrades #492

Conversation

israel-hdez
Copy link

What this PR does / why we need it:

Following up with comments here: #491 (comment)

The reverts commit 2a917ea (PR #491), which changed the configuration of the Deployment for the manager to use Recreate strategy. Such change was for better compatibility for in-place upgrades (see description of PR #491).

Instead of changing the Deployment strategy, this is enabling leader election in the manager. The leader election also solves the issues mentioned in PR #491. Also, not changing the Deployment strategy works better with odh-operator.

Which issue(s) this PR fixes
Fixes https://issues.redhat.com/browse/RHOAIENG-18977

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Similar testing as in PR #491. However, logs should reveal that when duplicating the kserve-controller Deployment, the second deployment should wait until it is capable of acquiring the lease.

Special notes for your reviewer:

On the very first upgrade, despite the updated configuration, the new version of the manager would still run in parallel along the old version. This is expected, because the older version doesn't have enabled leader election. This should be OK, as we still don't promote InferenceGraphs as supported in ODH. Once this change is released, on following ODH upgrades we should observe the expected behavior of the new version not fully booting until it acquires the lease to be the leader. This is the reason for testing this PR by duplicating the deployment.

Checklist:

  • [N/A] Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • [N/A] Has code been commented, particularly in hard-to-understand areas?
  • [N/A] Have you made corresponding changes to the documentation?

The previous commit reverts 2a917e, which changed the configuration of the Deployment for the manager to use `Recreate` strategy. Such change was for better compatibility for in-place upgrades (see the commit message of the reverted one).

Instead of changing the Deployment strategy, this commit is enabling leader election in the manager. The leader election also solves the issues mentioned in commit 2a917e. Also, not changing the Deployment strategy works better with odh-operator.

Signed-off-by: Edgar Hernández <[email protected]>
@lburgazzoli
Copy link

lburgazzoli commented Feb 10, 2025

I'm not familiar with all the details of kserve, but looks good to me

@danielezonca
Copy link

@israel-hdez
Do you plan to move this change/fix to upstream?
As far as I see it is not ODH specific

Copy link

openshift-ci bot commented Feb 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielezonca, israel-hdez, spolti, zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [danielezonca,israel-hdez,spolti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spolti
Copy link
Member

spolti commented Feb 11, 2025

@israel-hdez

Do you plan to move this change/fix to upstream?

As far as I see it is not ODH specific

There is a pr on upstream.

@spolti
Copy link
Member

spolti commented Feb 11, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 11, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 5bfa8ea into opendatahub-io:release-v0.14 Feb 11, 2025
20 checks passed
@israel-hdez
Copy link
Author

@israel-hdez
Do you plan to move this change/fix to upstream?
As far as I see it is not ODH specific

@danielezonca Yes, it is here: kserve#4234

@israel-hdez israel-hdez deleted the j18977-leader-election branch February 11, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants