-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two nodes openshift #1675
base: master
Are you sure you want to change the base?
Two nodes openshift #1675
Conversation
Skipping CI for Draft Pull Request. |
I think this will require some changes to It would be good to at least list out the additional information that will be needed (fencing info at least) if not propose the new API. |
Signed-off-by: Michael Shitrit <[email protected]>
|
||
#### Cluster Creator Role: | ||
* The Cluster Creator will automatically install the 2NO (by using an installer), installation process will include the following steps: | ||
* Deploys a two-node OpenShift cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is it a 2NO if the steps after this have not been completed, I feel like this line is greatly simplifying the workflow here and maybe we should be going into more details about what this actually means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be clearer now.
I think the options are that a cluster is 2NO when:
- 2 nodes, and
- the proposed externallyManagedEtcd field is true, and
- (optional?) a new feature gate is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using a feature gate, but that only flags that the cluster is not upgradeable because it's using a tech preview or even dev preview feature. By the time it goes GA that will be promoted and the cluster will be upgradeable
Still useful to avoid changes breaking the main payload though
2. Etcd loses internal quorum (E-quorum) and goes read-only | ||
3. Both sides retain C-quorum and initiate fencing of the other side. There is a different delay between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. | ||
4. One side wins, pre-configured as Node1 | ||
5. Pacemaker on Node1 forces E-quorum (etcd promotion event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were both members already voting members of the etcd cluster, or was one just a follower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
voting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When etcd loses quorum, and you have two halves, both are in read only states.
Have you spoken to the etcd folks/documented what is actually involved in this step?
It's been a while since I've had to do this, so things may have changed, but my understanding is that the only way to recover from this is to either reconnect them, or, force a "new cluster" using the old state, which removes all members and starts the cluster again. On the other member you'd also need to remove state and have it come us as a fresh etcd, resync across and then promote again.
"forcing quorum" makes this sound easy, but I think the document currently underplays exactly what's involved in this process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensive consultation with the etc team :-)
The updated proposal should make things clearer, or do you think we need the exact commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not exact commands, but a quick delve into the process and risks of forcing etcd quorum might be useful context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed, you are right, it sounds easier that what it is.
We are considering and discussing the cases and states together with the Etcd team.
quick delve into the process
It is not quick 😄, but it boils down mostly in the two ways a node can go down
- gracefully
- non-gracefully
In the first, the shutting down node can remove itself from the member list, which let the surviving node continue standalone without rebooting into a new cluster.
In the second, the surviving node must force a new cluster, instead.
The restarting node will decide what to do based on the status of the other node (running, starting, or stopped) and on data about Cluster ID and Revision from both nodes.
We didn't complete the picture of the cases, but I wonder what level of detail we want to report in this enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd be keen to see the process documented fairly thoroughly, as it's rather key to the project working and being safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I will add a PR with more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated 156c11a
…eams are most likely to be interested in and adjust the level of detail
* Minimize recovery-caused unavailability. Eg. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability. | ||
* Recover the API server in less than 120s, as measured by the surviving node's detection of a failure | ||
* Minimize any differences to existing OpenShift topologies | ||
* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means turning a two node cluster into a three node cluster? Are we sure this tallies with the rest of the doc? It seems to contradict the final non-goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it's specifically calling out going from 2NO to HA-Compact.
This is because while support for that is a non-goal, PM has made it clear that there is a strong interest in this as a follow-on feature. Since we don't have requirements for this, we wanted to point out that while it's a secondary (or tertiary) goal, we'd like to avoid implementations that will create lots of tech debt if requirements firm up for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standalone compact being a three node control plane with no workers?
What controlPlaneToplogy
is a three node compact? And does that mean we then need to explore more about the issues with operators that are not compatible today with changing toplogy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to standalone compact being 3 control plane nodes and 0 compute.
The topology used for anything with 3 or more control plane nodes is HighlyAvailable
I think we need to explore topology transitions as their own enhancement proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agreed, for now, I think as a product, and in this EP, we should assume transitions are not supported
Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. | ||
To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes). | ||
This will take a format similar to that of the [Baremetal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). | ||
We will create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these already written to disk on control plane nodes in similar baremetal clusters, or do they only exist today inside the cluster etcd for use by the BMO controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. BMO stores the URL and a link to a credentials secret in a BareMetalHost CRD.
For our own implementation, it is simpler to load these on disk, since we don't have anything to consume these in-cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any security concerns tied to adding this new credentials file to disk? Does this increase our exposure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we need to do more work on. Writing credentials to disk is inherently risky. The observation I'd make is that even if you forced credentials to be loaded in through some kind of vault, the nodes would still have a credential they can use to get a credential - and a compromised system could be at risk.
I think what we need to do is figure out how to lock down these credentials so they're as secure as possible at rest on the system. We could use something like the Gnome Keyring to lock them up when at rest, but pacemaker already has a utility for caching the credentials securely. So then the question is, do we delete them after pacemaker has been initialized?
And how do we ensure there is still a way for a user to login to the machines and update the credentials? These all need to be nailed down with specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to rely on k8s's Secret management here rather than roll a new thing. MachineConfigs themselves are not encrypted at rest AFAIK.
Can the process that needs the credentials be deployed as a DaemonSet with a Secret volume mounted into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into the specifics of using a daemonset. The concern I have is in the case the cluster enters a state where etcd quorum is lost, is that volume still mounted to disk? I think the answer is yes, since that shouldn't affect the scheduling of existing containers.
I think this would work for the initial setup. It would probably also be possible to handle secret updates with this, by allowing the user to update the secrets file when the cluster is healthy.
The biggest concern is when the configuration becomes invalid when the cluster is unhealthy. In this case, you need to run a reinitialization process directly on the nodes to update the secret. This may be fine, but I want to review with the SMEs to ensure that the situations where this happens and where a full recovery would be impossible without manual intervention are exactly 1 to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running workloads can keep running without an API server. Although if the Pod is dead at the same time as the API server goes down you are in trouble.
A static Pod would be better but you can't mount Secrets in them 🙁
|
||
A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling. | ||
We will define a new value for the `TopologyMode` enum: `DualReplica`. | ||
The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And HighlyAvailableArbiter. We probably want somewhere in this doc to mention any connection between DualReplica and HighlyAvailableArbiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have set aside a task to go through the arbiter EP and port over any relevant/related content. It is strangely absent from this document. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if you create a new topology then all OLM operators have to be updated to understand what to do with it.
For 2NO I think the appropriate infrastructureTopology
is still Highly Available
, because a regular HA cluster has a minimum of only 2 workers anyway. At least, that's the argument I made on the arbiter EP 🙂
Arguably there is a need for a separate controlPlaneTopology
here, since up to now HA has always meant 3 control plane nodes, and some operators may be relying on that. Hopefully fewer operators are looking at controlPlaneTopology
than would be looking at infrastructureTopology
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's appropriate that we add a new ControlPlaneToplogy (which is what arbiter did) but I don't think the infrastructure topology needs to be changed, that would make sense, as you say, as things expect 2 schedulable workers
Initially the creation of an etcd cluster will be driven in the same way as other platforms. | ||
Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. | ||
At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. | ||
This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disable flag is part of the UnsupportedConfigOverrides of the cluster/etcd CRD though.
So, if this EP relies on something that ends up in an UnsupportedConfigOverrides field, does this put the CO into a degraded state?
I'm thinking about the optics of a supported feature using something that is considered unsupported generally, is there a way we can improve this?
What is the specific value that is being updated? Is this the new proposal below?
At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. | ||
This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. | ||
|
||
To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any PRs for this API change? Where has the design of this got to?
|
||
While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to bootstrap from one of the target machines to remove the requirement for a bootstrap node. | ||
|
||
So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there discussion somewhere about the issues found and the mitigations? In paritcular I'm interested in the CPMSO parts. I suspect we want to disable the feature on this toplogy. It relies on Machine API anyway, are we expecting MAPI to be available as part of this toplogy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machine API isn't relevant for this topology (as described in this proposal). If requirements come for compute nodes, then that could change.
I outlined the sensitivities I found in: https://docs.google.com/document/d/15NOPO50aLC5onUP-m6cVy1CguvCVfJmLYiKcREazbQE/edit?tab=t.0#bookmark=id.5y5cwxlcsndw
CPMSO can be disabled (we tested that early on). I don't fully understand what problems it's supposed to solve, so I assumed some follow up discussions with the maintainers before I declared a resolution path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPMSO is intended to provide an automated way to update control plane machines, but if there's no machine API, there's no CPMSO. Have you considered the capabilities API and just disabling Machine API completely on this toplogy? And potentially other capabilities that don't correlate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the suggestion about reviewing the capabilities API as a whole. I will add a task for that.
As far as disabling CPMSO, I will revisit how we disable that after reviewing the capabilities API. :)
It may be prudent to disable the CPMSO as well as the machine API in the first pass, so that there is still a level of protection if something down the line pushes us to enable machine API.
|
||
The two-node architecture represents yet another distinct install type for users to choose from. | ||
|
||
The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this on the control plane arch call today, we can add a validation to make this immutable, but first must make sure the field is always populated. I need to discuss with @deads2k about how this has been handled in the past
Either way, I think we should assume that transitioning is not supported and document that here
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
…OLA EP. In this commit, I ran through all of the content in the OLA EP (https://github.com/openshift/enhancements/pull/1674/files) and brought over any details or structure that I found relevant or helpful for the 2-node proposal. I also tweaked some sections with formatting updates and updated some of the overall content to reflect the current plan and understanding.
OCPBUGS-1460: Update 2NO EP with relevant structure and content from OLA EP.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
enhancements/two-node-fencing/tnf.md
Outdated
Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include | ||
the `DualReplica` infrastructureTopology as an option that operators can levarage to communicate cluster-compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me. However, I'd just like to call out that OLM itself doesn't implement or even know about the infrastructureTopology annotations. The filtering happens only in the OCP console, and there is validation (I think in CVP?) that ensures that all operators explicitly declare support (or not) for each supported infrastructure feature.
enhancements/two-node-fencing/tnf.md
Outdated
Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the | ||
Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for | ||
in the installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments on this:
- I don't see how duplicating the same information solve the conflict between BMO and the pacemaker fencing agent.
- The fencingCredentials is per node this example and the ones above show a single BMC credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BMO is using these credentials to manage the machine state, unsure why we need another way to provide the same information twice in the install-config.
I think it does make sense to add the fencingCredentials for None platform, and in case of baremetal platform just get the bmc creds from the hosts list and create additional secrets (or some other recourse) for pacemaker fencing agent (in addition to the secrets in the openshift-machine-api namespace)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to not add the BMC credentials to the BMH CRs in the installer (i.e. they should come out as unmanaged
, like they do from the assisted SaaS), otherwise ironic will try to fight the fencing agent.
If we were targeting just the baremetal
platform, as in previous versions of this enhancement, I think we could reuse the existing fields for different purposes in different topologies.
enhancements/two-node-fencing/tnf.md
Outdated
Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the | ||
Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for | ||
in the installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to not add the BMC credentials to the BMH CRs in the installer (i.e. they should come out as unmanaged
, like they do from the assisted SaaS), otherwise ironic will try to fight the fencing agent.
If we were targeting just the baremetal
platform, as in previous versions of this enhancement, I think we could reuse the existing fields for different purposes in different topologies.
enhancements/two-node-fencing/tnf.md
Outdated
One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. | ||
The bottom line is that it will take some engineering effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift. | ||
|
||
Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Baremetal Platform team, we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not at all the reason.
The reason is that HA of the baremetal platform components is very much dependent on a working read-write control plane.
Fencing is absolutely part of the purpose/scope of BMO. However, users have already noted that for a 3-node (compact) cluster failover is too slow, and for a 2-node cluster it provably cannot work at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misinterpreted the intention of this comment.
This section should explain that the existing BMO-based fencing is insufficient to meet our requirements and that integrating with BMO to reuse the existing APIs that BMO exposes would add a dependency on BMO to behave like a baremetal inventory - which is not its purpose.
enhancements/two-node-fencing/tnf.md
Outdated
|
||
Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both `platform: none` and platform `platform: baremetal` clusters. | ||
|
||
For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both an Ingress and an API load balancer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW #1666 is the enhancement proposal for this feature, but it seems to be stalled at the moment.
none: | ||
fencingCredentials: | ||
bmc: | ||
address: ipmi://<out_of_band_ip> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only going to support redfish here I thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TP, redfish only. For future releases, we plan on adding IPMI.
I'll update the example.
Signed-off-by: Michael Shitrit <[email protected]>
enhancements/two-node-fencing/tnf.md
Outdated
## Version Skew Strategy | ||
|
||
The biggest concern with version skew would be incompatibilies between a new version of pacemaker and the currently running resource agents. | ||
Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running | |
Upgrades will not atomically replace both the RPM and the resource agent configuration, nor are there any guarantees that both nodes will be running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhensel-rh I think atomically is correct here, as a in something that being done in an atomic way, but the nor
is a good fix.
Can you please modify the suggestion ?
**Note:** *Section not required until targeted at a release.* | ||
|
||
### CI | ||
The initial release of TNF should aim to build a regression baseline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the initial release considered TechPreview or is it GA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TechPreview
Co-authored-by: Douglas Hensel <[email protected]>
Co-authored-by: Douglas Hensel <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Managing Fencing credentials
Updates include: - Clarifications about the fencing network needing to be separate - Calling out unsuitable work loads (e.g. safety-critical) - Explaining why upgrades only work when both nodes are healthy - Comparing TNF to active-passive SNO - Added test for verifying certification rotation with an unhealthy node - Updated baremetal usage to match https://github.com/metal3-io/metal3-docs/blob/master/design/bare-metal-style-guide.md
…node - Topology transitions may be discussed later. - Running TNF with a failed node is not the same as Single Node OpenShift
OCPEDGE-1458: Addressing post-architecture review feedback
Signed-off-by: Michael Shitrit <[email protected]>
Note about how we'd manage pacemaker configuration
@mshitrit: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This enhancement describes a high level design concept of building a new two nodes openshift.
For additional info/discussion: