Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce network interruption of quorum migration #1406

Draft
wants to merge 1 commit into
base: stackhpc/2023.1
Choose a base branch
from

Conversation

priteau
Copy link
Member

@priteau priteau commented Dec 5, 2024

We can reduce the potential network connectivity interruption caused by quorum migration by stopping Keystone and Neutron last and starting them first, at the expense of longer API downtime (because each kayobe invocation first generates configuration).

We can reduce the potential network connectivity interruption caused by
quorum migration by stopping Keystone and Neutron last and starting them
first, at the expense of longer API downtime (because each kayobe
invocation first generates configuration).
@priteau priteau self-assigned this Dec 5, 2024
@MoteHue
Copy link
Contributor

MoteHue commented Jan 6, 2025

Any reason this is still a draft @priteau? It looks very helpful for OVS-based systems, would be good to get it in :)

@priteau
Copy link
Member Author

priteau commented Jan 6, 2025

I set it to draft because I wanted to hear thoughts on the approach. The code change itself is ready.

@MoteHue
Copy link
Contributor

MoteHue commented Jan 9, 2025

I think this approach is fine, although it does mean there will be a longer downtime on the non-critical services.
Perhaps we can just split the tasks when OVS is in use?

@Alex-Welsh
Copy link
Member

@MoteHue didn't you try this recently? Can we merge it now?

@MoteHue
Copy link
Contributor

MoteHue commented Mar 3, 2025

@MoteHue didn't you try this recently? Can we merge it now?

I haven't tried this change personally. I'd prefer that we change this split to be done via a flag, or lookup if OVS is in use. Otherwise we're just extending the outage window on OVN systems with no benefit.

@Alex-Welsh
Copy link
Member

@grzegorzkoper You have an upcoming OVS upgrade right? Would you mind testing this change when you do that?

@grzegorzkoper
Copy link
Contributor

grzegorzkoper commented Mar 3, 2025

@Alex-Welsh : I can discuss it with our client, if they don't mind I can run it like that.

@grzegorzkoper
Copy link
Contributor

@Alex-Welsh : I can discuss it with our client, if they don't mind I can run it like that.

They prefer shorter networking downtime vs API downtime. I can test it next week.

@assumptionsandg
Copy link
Contributor

Potentially unrelated, but this is the approach we used for the quorum migration for OVS at Cambridge. https://gitlab.developers.cam.ac.uk/rcs/platforms/cloud-services/arcus-kayobe-config/-/merge_requests/537/diffs#699dcf399709f4a7d55ccaeed5717da26b459bb8

@grzegorzkoper
Copy link
Contributor

grzegorzkoper commented Mar 14, 2025

@grzegorzkoper You have an upcoming OVS upgrade right? Would you mind testing this change when you do that?

Tested, worked like a charm.
We did hit an issue while upgrading RMQ, API downtime was long enough to trigger dhcp issues (leases expired, no new leases since services were down) It was nice to be able to bring keystone and neutron back faster and resolve them.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants