Two nodes openshift #1675

mshitrit · 2024-09-05T11:21:14Z

This enhancement describes a high level design concept of building a new two nodes openshift.

For additional info/discussion:

openshift-ci · 2024-09-05T11:21:29Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enhancements/two-nodes-openshift/2no_template.md

carbonin · 2024-09-16T12:21:02Z

I think this will require some changes to InstallConfig eventually.

It would be good to at least list out the additional information that will be needed (fencing info at least) if not propose the new API.

enhancements/two-nodes-openshift/2no_template.md

Signed-off-by: Michael Shitrit <[email protected]>

enhancements/two-nodes-openshift/2no.md

JoelSpeed · 2024-09-30T15:41:43Z

enhancements/two-nodes-openshift/2no.md

+
+#### Cluster Creator Role:
+* The Cluster Creator will automatically install the 2NO (by using an installer), installation process will include the following steps:  
+  * Deploys a two-node OpenShift cluster


Nit: Is it a 2NO if the steps after this have not been completed, I feel like this line is greatly simplifying the workflow here and maybe we should be going into more details about what this actually means

Should be clearer now.
I think the options are that a cluster is 2NO when:

2 nodes, and

the proposed externallyManagedEtcd field is true, and

(optional?) a new feature gate is enabled

+1 to using a feature gate, but that only flags that the cluster is not upgradeable because it's using a tech preview or even dev preview feature. By the time it goes GA that will be promoted and the cluster will be upgradeable

Still useful to avoid changes breaking the main payload though

enhancements/two-nodes-openshift/2no.md

JoelSpeed · 2024-09-30T15:44:10Z

enhancements/two-nodes-openshift/2no.md

+   2. Etcd loses internal quorum (E-quorum) and goes read-only
+   3. Both sides retain C-quorum and initiate fencing of the other side. There is a different delay between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely.
+   4. One side wins, pre-configured as Node1
+   5. Pacemaker on Node1 forces E-quorum (etcd promotion event)


Were both members already voting members of the etcd cluster, or was one just a follower?

When etcd loses quorum, and you have two halves, both are in read only states.

Have you spoken to the etcd folks/documented what is actually involved in this step?

It's been a while since I've had to do this, so things may have changed, but my understanding is that the only way to recover from this is to either reconnect them, or, force a "new cluster" using the old state, which removes all members and starts the cluster again. On the other member you'd also need to remove state and have it come us as a fresh etcd, resync across and then promote again.

"forcing quorum" makes this sound easy, but I think the document currently underplays exactly what's involved in this process

Extensive consultation with the etc team :-)
The updated proposal should make things clearer, or do you think we need the exact commands?

Maybe not exact commands, but a quick delve into the process and risks of forcing etcd quorum might be useful context?

IIUC the process of forcing e-quorum boils down to triggering the MemberPromote API.

@tjungblu please let me know if I got it right and if there are particular risks we should consider.

@JoelSpeed, you are right, it sounds easier that what it is.
We are considering and discussing the cases and states together with the Etcd team.

quick delve into the process

It is not quick 😄, but it boils down mostly in the two ways a node can go down

gracefully

non-gracefully

In the first, the shutting down node can remove itself from the member list, which let the surviving node continue standalone without rebooting into a new cluster.

In the second, the surviving node must force a new cluster, instead.

The restarting node will decide what to do based on the status of the other node (running, starting, or stopped) and on data about Cluster ID and Revision from both nodes.

We didn't complete the picture of the cases, but I wonder what level of detail we want to report in this enhancement

Personally, I'd be keen to see the process documented fairly thoroughly, as it's rather key to the project working and being safe

Fair enough, I will add a PR with more details

updated 156c11a

enhancements/two-nodes-openshift/2no.md

…eams are most likely to be interested in and adjust the level of detail

JoelSpeed · 2024-12-13T12:08:57Z

enhancements/two-nodes-openshift/2no.md

+* Minimize recovery-caused unavailability. Eg. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability.
+* Recover the API server in less than 120s, as measured by the surviving node's detection of a failure
+* Minimize any differences to existing OpenShift topologies
+* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures


This means turning a two node cluster into a three node cluster? Are we sure this tallies with the rest of the doc? It seems to contradict the final non-goal

Yes - it's specifically calling out going from 2NO to HA-Compact.

This is because while support for that is a non-goal, PM has made it clear that there is a strong interest in this as a follow-on feature. Since we don't have requirements for this, we wanted to point out that while it's a secondary (or tertiary) goal, we'd like to avoid implementations that will create lots of tech debt if requirements firm up for us.

Standalone compact being a three node control plane with no workers?

What controlPlaneToplogy is a three node compact? And does that mean we then need to explore more about the issues with operators that are not compatible today with changing toplogy?

Yes, to standalone compact being 3 control plane nodes and 0 compute.
The topology used for anything with 3 or more control plane nodes is HighlyAvailable

I think we need to explore topology transitions as their own enhancement proposal.

Yep, agreed, for now, I think as a product, and in this EP, we should assume transitions are not supported

JoelSpeed · 2024-12-13T12:22:26Z

enhancements/two-nodes-openshift/2no.md

+Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization.
+To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes).
+This will take a format similar to that of the [Baremetal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow).
+We will create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD.


Are these already written to disk on control plane nodes in similar baremetal clusters, or do they only exist today inside the cluster etcd for use by the BMO controllers?

The latter. BMO stores the URL and a link to a credentials secret in a BareMetalHost CRD.
For our own implementation, it is simpler to load these on disk, since we don't have anything to consume these in-cluster.

Are there any security concerns tied to adding this new credentials file to disk? Does this increase our exposure?

This is something we need to do more work on. Writing credentials to disk is inherently risky. The observation I'd make is that even if you forced credentials to be loaded in through some kind of vault, the nodes would still have a credential they can use to get a credential - and a compromised system could be at risk.

I think what we need to do is figure out how to lock down these credentials so they're as secure as possible at rest on the system. We could use something like the Gnome Keyring to lock them up when at rest, but pacemaker already has a utility for caching the credentials securely. So then the question is, do we delete them after pacemaker has been initialized?

And how do we ensure there is still a way for a user to login to the machines and update the credentials? These all need to be nailed down with specifics.

I think we want to rely on k8s's Secret management here rather than roll a new thing. MachineConfigs themselves are not encrypted at rest AFAIK.
Can the process that needs the credentials be deployed as a DaemonSet with a Secret volume mounted into it?

I'll look into the specifics of using a daemonset. The concern I have is in the case the cluster enters a state where etcd quorum is lost, is that volume still mounted to disk? I think the answer is yes, since that shouldn't affect the scheduling of existing containers.

I think this would work for the initial setup. It would probably also be possible to handle secret updates with this, by allowing the user to update the secrets file when the cluster is healthy.

The biggest concern is when the configuration becomes invalid when the cluster is unhealthy. In this case, you need to run a reinitialization process directly on the nodes to update the secret. This may be fine, but I want to review with the SMEs to ensure that the situations where this happens and where a full recovery would be impossible without manual intervention are exactly 1 to 1.

Running workloads can keep running without an API server. Although if the Pod is dead at the same time as the API server goes down you are in trouble.

A static Pod would be better but you can't mount Secrets in them 🙁

JoelSpeed · 2024-12-13T12:33:19Z

enhancements/two-nodes-openshift/2no.md

+
+A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling.
+We will define a new value for the `TopologyMode` enum: `DualReplica`.
+The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`.


And HighlyAvailableArbiter. We probably want somewhere in this doc to mention any connection between DualReplica and HighlyAvailableArbiter

I have set aside a task to go through the arbiter EP and port over any relevant/related content. It is strangely absent from this document. 😅

Note that if you create a new topology then all OLM operators have to be updated to understand what to do with it.
For 2NO I think the appropriate infrastructureTopology is still Highly Available, because a regular HA cluster has a minimum of only 2 workers anyway. At least, that's the argument I made on the arbiter EP 🙂
Arguably there is a need for a separate controlPlaneTopology here, since up to now HA has always meant 3 control plane nodes, and some operators may be relying on that. Hopefully fewer operators are looking at controlPlaneTopology than would be looking at infrastructureTopology.

Yes, I think it's appropriate that we add a new ControlPlaneToplogy (which is what arbiter did) but I don't think the infrastructure topology needs to be changed, that would make sense, as you say, as things expect 2 schedulable workers

JoelSpeed · 2024-12-13T13:00:48Z

enhancements/two-nodes-openshift/2no.md

+Initially the creation of an etcd cluster will be driven in the same way as other platforms.
+Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA.
+At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled.
+This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy.


The disable flag is part of the UnsupportedConfigOverrides of the cluster/etcd CRD though.

So, if this EP relies on something that ends up in an UnsupportedConfigOverrides field, does this put the CO into a degraded state?

I'm thinking about the optics of a supported feature using something that is considered unsupported generally, is there a way we can improve this?

What is the specific value that is being updated? Is this the new proposal below?

JoelSpeed · 2024-12-13T13:01:49Z

enhancements/two-nodes-openshift/2no.md

+At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled.
+This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy.
+
+To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`.


Do we have any PRs for this API change? Where has the design of this got to?

JoelSpeed · 2024-12-13T13:04:40Z

enhancements/two-nodes-openshift/2no.md

+
+While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to bootstrap from one of the target machines to remove the requirement for a bootstrap node.
+
+So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology.


Is there discussion somewhere about the issues found and the mitigations? In paritcular I'm interested in the CPMSO parts. I suspect we want to disable the feature on this toplogy. It relies on Machine API anyway, are we expecting MAPI to be available as part of this toplogy?

Machine API isn't relevant for this topology (as described in this proposal). If requirements come for compute nodes, then that could change.

I outlined the sensitivities I found in: https://docs.google.com/document/d/15NOPO50aLC5onUP-m6cVy1CguvCVfJmLYiKcREazbQE/edit?tab=t.0#bookmark=id.5y5cwxlcsndw

CPMSO can be disabled (we tested that early on). I don't fully understand what problems it's supposed to solve, so I assumed some follow up discussions with the maintainers before I declared a resolution path.

CPMSO is intended to provide an automated way to update control plane machines, but if there's no machine API, there's no CPMSO. Have you considered the capabilities API and just disabling Machine API completely on this toplogy? And potentially other capabilities that don't correlate?

I appreciate the suggestion about reviewing the capabilities API as a whole. I will add a task for that.

As far as disabling CPMSO, I will revisit how we disable that after reviewing the capabilities API. :)
It may be prudent to disable the CPMSO as well as the machine API in the first pass, so that there is still a level of protection if something down the line pushes us to enable machine API.

JoelSpeed · 2024-12-13T13:16:18Z

enhancements/two-nodes-openshift/2no.md

+
+The two-node architecture represents yet another distinct install type for users to choose from.
+
+The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change.


I discussed this on the control plane arch call today, we can add a validation to make this immutable, but first must make sure the field is always populated. I need to discuss with @deads2k about how this has been handled in the past

Either way, I think we should assume that transitioning is not supported and document that here

openshift-bot · 2025-01-14T01:15:11Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

jaypoulz · 2025-01-17T14:34:36Z

/remove-lifecycle stale

…OLA EP. In this commit, I ran through all of the content in the OLA EP (https://github.com/openshift/enhancements/pull/1674/files) and brought over any details or structure that I found relevant or helpful for the 2-node proposal. I also tweaked some sections with formatting updates and updated some of the overall content to reflect the current plan and understanding.

OCPBUGS-1460: Update 2NO EP with relevant structure and content from OLA EP.

openshift-ci · 2025-01-22T13:44:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jerpeter1. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

joelanford · 2025-01-24T17:54:08Z

enhancements/two-node-fencing/tnf.md

+Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include
+the `DualReplica` infrastructureTopology as an option that operators can levarage to communicate cluster-compatibility.


This sounds good to me. However, I'd just like to call out that OLM itself doesn't implement or even know about the infrastructureTopology annotations. The filtering happens only in the OCP console, and there is validation (I think in CVP?) that ensures that all operators explicitly declare support (or not) for each supported infrastructure feature.

eranco74 · 2025-01-27T13:20:01Z

enhancements/two-node-fencing/tnf.md

+Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the
+Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for
+in the installer.


Few comments on this:

I don't see how duplicating the same information solve the conflict between BMO and the pacemaker fencing agent.

The fencingCredentials is per node this example and the ones above show a single BMC credentials.

BMO is using these credentials to manage the machine state, unsure why we need another way to provide the same information twice in the install-config.
I think it does make sense to add the fencingCredentials for None platform, and in case of baremetal platform just get the bmc creds from the hosts list and create additional secrets (or some other recourse) for pacemaker fencing agent (in addition to the secrets in the openshift-machine-api namespace)

We also have to not add the BMC credentials to the BMH CRs in the installer (i.e. they should come out as unmanaged, like they do from the assisted SaaS), otherwise ironic will try to fight the fencing agent.

If we were targeting just the baremetal platform, as in previous versions of this enhancement, I think we could reuse the existing fields for different purposes in different topologies.

zaneb · 2025-01-28T04:01:57Z

enhancements/two-node-fencing/tnf.md

+Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the
+Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for
+in the installer.


We also have to not add the BMC credentials to the BMH CRs in the installer (i.e. they should come out as unmanaged, like they do from the assisted SaaS), otherwise ironic will try to fight the fencing agent.

If we were targeting just the baremetal platform, as in previous versions of this enhancement, I think we could reuse the existing fields for different purposes in different topologies.

zaneb · 2025-01-28T04:04:33Z

enhancements/two-node-fencing/tnf.md

+One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP.
+The bottom line is that it will take some engineering effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift.
+
+Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Baremetal Platform team, we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes.


This is not at all the reason.
The reason is that HA of the baremetal platform components is very much dependent on a working read-write control plane.
Fencing is absolutely part of the purpose/scope of BMO. However, users have already noted that for a 3-node (compact) cluster failover is too slow, and for a 2-node cluster it provably cannot work at all.

I think I misinterpreted the intention of this comment.
This section should explain that the existing BMO-based fencing is insufficient to meet our requirements and that integrating with BMO to reuse the existing APIs that BMO exposes would add a dependency on BMO to behave like a baremetal inventory - which is not its purpose.

zaneb · 2025-01-28T04:05:10Z

enhancements/two-node-fencing/tnf.md

+
+Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both `platform: none` and platform `platform: baremetal` clusters.
+
+For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future.


Both an Ingress and an API load balancer.

BTW #1666 is the enhancement proposal for this feature, but it seems to be stalled at the moment.

zaneb · 2025-01-28T04:09:38Z

enhancements/two-node-fencing/tnf.md

+  none:
+    fencingCredentials:
+      bmc:
+          address: ipmi://<out_of_band_ip>


We are only going to support redfish here I thought?

For TP, redfish only. For future releases, we plan on adding IPMI.
I'll update the example.

Signed-off-by: Michael Shitrit <[email protected]>

dhensel-rh · 2025-02-04T18:37:01Z

enhancements/two-node-fencing/tnf.md

+## Version Skew Strategy
+
+The biggest concern with version skew would be incompatibilies between a new version of pacemaker and the currently running resource agents.
+Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running


Suggested change

Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running

Upgrades will not atomically replace both the RPM and the resource agent configuration, nor are there any guarantees that both nodes will be running

@dhensel-rh I think atomically is correct here, as a in something that being done in an atomic way, but the nor is a good fix.
Can you please modify the suggestion ?

enhancements/two-node-fencing/tnf.md

dhensel-rh · 2025-02-04T22:01:18Z

enhancements/two-node-fencing/tnf.md

+**Note:** *Section not required until targeted at a release.*
+
+### CI
+The initial release of TNF should aim to build a regression baseline.


Is the initial release considered TechPreview or is it GA ?

TechPreview

Co-authored-by: Douglas Hensel <[email protected]>

Signed-off-by: Michael Shitrit <[email protected]>

Managing Fencing credentials

Updates include: - Clarifications about the fencing network needing to be separate - Calling out unsuitable work loads (e.g. safety-critical) - Explaining why upgrades only work when both nodes are healthy - Comparing TNF to active-passive SNO - Added test for verifying certification rotation with an unhealthy node - Updated baremetal usage to match https://github.com/metal3-io/metal3-docs/blob/master/design/bare-metal-style-guide.md

…node - Topology transitions may be discussed later. - Running TNF with a failed node is not the same as Single Node OpenShift

…rapping.

OCPEDGE-1458: Addressing post-architecture review feedback

Signed-off-by: Michael Shitrit <[email protected]>

Note about how we'd manage pacemaker configuration

openshift-ci · 2025-02-17T13:28:16Z

@mshitrit: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2024

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

beekhof reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

carbonin reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

tjungblu reviewed Sep 16, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

mshitrit commented Sep 17, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no_template.md Outdated Show resolved Hide resolved

mshitrit marked this pull request as ready for review September 19, 2024 14:20

openshift-ci bot requested review from dougbtv and LalatenduMohanty September 19, 2024 14:25

mshitrit force-pushed the 2no branch from 7622501 to 012fdb1 Compare September 22, 2024 08:04

Adding the 2no - two nodes openshift template

a9b71ab

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit force-pushed the 2no branch from 012fdb1 to a9b71ab Compare September 22, 2024 09:03

carbonin reviewed Sep 23, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no.md Outdated Show resolved Hide resolved

rwsu reviewed Sep 24, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no.md Outdated Show resolved Hide resolved

clobrano reviewed Sep 25, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no.md Outdated Show resolved Hide resolved

JoelSpeed reviewed Sep 30, 2024

View reviewed changes

clobrano reviewed Oct 1, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no.md Outdated Show resolved Hide resolved

clobrano reviewed Oct 1, 2024

View reviewed changes

enhancements/two-nodes-openshift/2no.md Outdated Show resolved Hide resolved

jerpeter1 assigned jerpeter1 and unassigned jerpeter1 Oct 1, 2024

beekhof added 2 commits October 3, 2024 21:48

Adjust the emphasis of the enhancement. Focus on the elements other t…

fd810fd

…eams are most likely to be interested in and adjust the level of detail

Cleanup implementation details and risks

188d9e4

JoelSpeed reviewed Dec 13, 2024

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2025

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2025

jaypoulz and others added 3 commits January 17, 2025 20:11

Updated 2NO acronym to TNF

32c5dc5

Merge pull request #4 from jaypoulz/2no

230964e

OCPBUGS-1460: Update 2NO EP with relevant structure and content from OLA EP.

joelanford reviewed Jan 24, 2025

View reviewed changes

eranco74 reviewed Jan 27, 2025

View reviewed changes

zaneb reviewed Jan 28, 2025

View reviewed changes

Managing Fencing credentials

668f125

Signed-off-by: Michael Shitrit <[email protected]>

dhensel-rh reviewed Feb 4, 2025

View reviewed changes

enhancements/two-node-fencing/tnf.md Outdated Show resolved Hide resolved

dhensel-rh reviewed Feb 4, 2025

View reviewed changes

enhancements/two-node-fencing/tnf.md Outdated Show resolved Hide resolved

dhensel-rh reviewed Feb 4, 2025

View reviewed changes

mshitrit and others added 12 commits February 5, 2025 10:28

Update enhancements/two-node-fencing/tnf.md

eb0bcb9

Co-authored-by: Douglas Hensel <[email protected]>

Update enhancements/two-node-fencing/tnf.md

cf863b0

Co-authored-by: Douglas Hensel <[email protected]>

Implementing discussions and review feedback.

7729c9a

Signed-off-by: Michael Shitrit <[email protected]>

Merge pull request #7 from mshitrit/tnf-fencing

3b7a8a8

Managing Fencing credentials

OCPBUGS-1468: [TNF] Added a section to address running with a failed …

5d3e84a

…node - Topology transitions may be discussed later. - Running TNF with a failed node is not the same as Single Node OpenShift

[TNF] Added a note explaining stance on adding compute nodes

51d12ea

[TNF] Unified document around a 200 column wrap to eliminate manual w…

e8f1e5b

…rapping.

Merge pull request #5 from jaypoulz/tnf

4a2e899

OCPEDGE-1458: Addressing post-architecture review feedback

Note about how we'd manage pacemaker configuration

0b1de86

Signed-off-by: Michael Shitrit <[email protected]>

Merge branch '2no' into pacemaker-config

f7e57e7

Merge pull request #8 from mshitrit/pacemaker-config

3abff35

Note about how we'd manage pacemaker configuration


		While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to bootstrap from one of the target machines to remove the requirement for a bootstrap node.

		So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology.


		The two-node architecture represents yet another distinct install type for users to choose from.

		The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change.

		Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include
		the `DualReplica` infrastructureTopology as an option that operators can levarage to communicate cluster-compatibility.


		Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both `platform: none` and platform `platform: baremetal` clusters.

		For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future.

	Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running
	Upgrades will not atomically replace both the RPM and the resource agent configuration, nor are there any guarantees that both nodes will be running

Two nodes openshift #1675

Are you sure you want to change the base?

Two nodes openshift #1675

Conversation

mshitrit commented Sep 5, 2024 • edited Loading

openshift-ci bot commented Sep 5, 2024

carbonin commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaypoulz Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

jaypoulz Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneb Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaypoulz Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jan 14, 2025

jaypoulz commented Jan 17, 2025

openshift-ci bot commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhensel-rh Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 17, 2025

mshitrit commented Sep 5, 2024 •

edited

Loading

jaypoulz Dec 13, 2024 •

edited

Loading

JoelSpeed Dec 13, 2024 •

edited

Loading

jaypoulz Dec 13, 2024 •

edited

Loading

zaneb Dec 16, 2024 •

edited

Loading

jaypoulz Dec 13, 2024 •

edited

Loading

dhensel-rh Feb 4, 2025 •

edited

Loading