From a9b71abbbf51fa1a42242ae8785bb89adc304618 Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Wed, 4 Sep 2024 17:58:26 +0300 Subject: [PATCH 01/49] Adding the 2no - two nodes openshift template Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 618 ++++++++++++++++++++++++ 1 file changed, 618 insertions(+) create mode 100644 enhancements/two-nodes-openshift/2no.md diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md new file mode 100644 index 0000000000..53ba92413f --- /dev/null +++ b/enhancements/two-nodes-openshift/2no.md @@ -0,0 +1,618 @@ +--- +title: 2no +authors: + - "@mshitrit" +reviewers: + - "@rwsu" + - "@fabbione" + - "@carbonin" + - "@thomasjungblut" + - "@brandisher" + - "@DanielFroehlich" + - "@jerpeter1" + - "@slintes" + - "@beekhof" + - "@eranco74" + - "@yuqi-zhang" + - "@gamado" + - "@razo7" + - "@frajamomo" + - "@clobrano" + +approvers: + - "@rwsu" + - "@fabbione" + - "@carbonin" + - "@thomasjungblut" + - "@brandisher" + - "@DanielFroehlich" + - "@jerpeter1" + - "@slintes" + - "@beekhof" + - "@eranco74" + - "@yuqi-zhang" +api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" + - "@jerpeter1" +creation-date: 2024-09-05 +last-updated: 2024-09-22 +tracking-link: + - https://issues.redhat.com/browse/OCPSTRAT-1514 +--- + +# Two Nodes Openshift (2NO) + +## Terms + +RHEL-HA - a general purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. + +Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. + +Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general purpose fault tolerance and automatic failover for critical services and applications. + +Resource Agent - A resource agent is an executable that manages a cluster resource. No formal definition of a cluster resource exists, other than "anything a cluster manages is a resource." Cluster resources can be as diverse as IP addresses, file systems, database services, and entire virtual machines - to name just a few examples. +
[more context here](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc) + +Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm or interference with the rest of the cluster. + +Fence Agent - Fence agents were developed as device "drivers" which are able to prevent computers from destroying data on shared storage. Their aim is to isolate a corrupted computer, using one of three methods: +* Power - A computer that is switched off cannot corrupt data, but it is important to not do a "soft-reboot" as we won't know if this is possible. This also works for virtual machines when the fence device is a hypervisor. +* Network - Switches can prevent routing to a given computer, so even if a computer is powered on it won't be able to harm the data. +* Configuration - Fibre-channel switches or SCSI devices allow us to limit who can write to managed disks. +
[more context here](https://github.com/ClusterLabs/fence-agents/) + +Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. +C-quorum: quorum as determined by Corosync members and algorithms +E-quorum: quorum as determined by etcd members and algorithms + +Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption unless state is stored outside of the cluster. + +MCO - Machine Config Operator. This operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new MachineConfig CRD that can write configuration files onto the host. + +ABI - Agent-Based Installer. + +ZTP - Zero-Touch Provisioning. + + +## Summary + +The Two Nodes OpenShift (2NO) initiative aims to provide a container management solution with a minimal footprint suitable for customers with numerous geographically dispersed locations. +Traditional three-node setups represent significant infrastructure costs, making them cost-prohibitive at retail and telco scale. This proposal outlines how we can implement a two-node OpenShift cluster while retaining the ability to survive a node failure. + +## Motivation + +Customers with tens-of-thousands of geographically dispersed locations seek a container management solution that retains some level of resilience but does not come with a traditional three-node footprint. Even "cheap" third nodes represent a significant cost at this scale. +The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. As the distance between a site and the central management hub grows, the number of servers at the site tends to shrink. The most distant sites often have physical space for only one or two servers. +We are seeing an emerging pattern where some infrastructure providers and application owners desire a consistent deployment approach for their workloads across these disparate environments. They also require that the edge sites operate independently from the central management hub. Users who have adopted Kubernetes at their central management sites wish to extend this independence to remote sites through the deployment of independent Kubernetes clusters. +For example, in the telecommunications industry, particularly within 5G Radio Access Networks (RAN), there is a growing trend toward cloud-native implementations of the 5G Distributed Unit (DU) component. This component, due to latency constraints, must be deployed close to the radio antenna, sometimes on a single server at remote locations like the base of a cell tower or in a datacenter-like environment serving multiple base stations. +A hypothetical DU might require 20 dedicated cores, 24 GiB of RAM consumed as huge pages, multiple SR-IOV NICs carrying several Gbps of traffic each, and specialized accelerator devices. The node hosting this workload must run a real-time kernel, be carefully tuned to meet low-latency requirements, and support features like Precision Timing Protocol (PTP). Crucially, the "cloud" hosting this workload must be autonomous, capable of continuing to operate with its existing configuration and running workloads even when centralized management functionality is unavailable. +Given these factors, a two-node deployment of OpenShift offers a consistent, reliable solution that meets the needs of customers across all their sites, from central management hubs to the most remote edge locations. + + +### User Stories + +* As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. +* As a support engineer, I want an automated method for handling the failure of a single node so that I can quickly restore service and maintain system integrity. +* As an infrastructure administrator, I want to ensure seamless failover for virtual machines (VMs) so that in the event of a node failure, the VMs are automatically migrated to a healthy node with minimal downtime and no data loss. +* As a network operator, I want my Cloud-Native Network Functions (CNFs) to be orchestrated consistently using OpenShift, regardless of whether they are in datacenters, or at the far edge where physical space is limited. + +### Goals + +* Implement a highly available two-node OpenShift cluster. +* Ensure cluster stability and operational efficiency. +* Provide clear methods for node failure management and recovery. +* Identify and integrate with a technology or partner that can provide storage in a two-node environment. + +### Non-Goals + +* Reliance on traditional third-node or SNO setups. +* Make sure we don't prevent upgrade/downgrade paths between two-node and traditional architectures +* Adding worker nodes +* Support for platforms other than bare metal including automated ci testing +* Support for other topologies (eg. hypershift) +* Failover time: if the leading node goes down, the remaining nodes takes over and gains operational state (writable) in less than 60s +* support full recovery of the workload when the node comes back online after restoration - total time under 15 minutes + + +## Proposal + + +To achieve a two-node OpenShift cluster, we are leveraging traditional high-availability concepts and technologies. The proposed solution includes: + +1. Leverage of the Full RHEL-HA Stack: + * Run the RHEL-HA stack “under to kubelet” (directly on the hardware, not as an OpenShift workload) + * Corosync for super fast failure detection, membership calculations, which in turn will trigger Pacemaker to apply Fencing based on Corosync quorum/membership information. + * Pacemaker for integrating membership and quorum information, driving fencing, and managing if/when kubelet and etcd can be started + * Pacemaker models kubelet and cri-o as a clone (much like a ReplicaSet) and etcd as a “promotable clone” (think a construct designed for leader/follower style services). Together with fencing and quorum, this ensures that an isolated node that reboots is inert and can do no harm. + * Pacemaker is [configured](//TODO mshitrit add link) to manage etcd/cri-o/kubelet, it will start/stop/restart those services using a script or an executable. + * Pacemaker does not understand what it is managing, and expects an executable or script that knows how to start/stop/monitor (and optionally promote/demote) the service. + 1. Likely we would need to create one for etcd + 2. For kubelet and cri-o we can likely use the existing systemd unit file +2. Failure Scenarios: + * Implement detailed handling procedures for cold boots, network failures, node failures, kubelet failures, and etcd failures using the RHEL-HA stack. + [see examples](#failure-handling) +3. Fencing Methods: + * We plan to use Baseboard Management Controller (BMC) as our primary fencing method, the premise of using BMC for fencing is that a node that is powered off, or was previously powered off and configured to be inert until quorum forms, is not in a position to cause corruption or diverging datasets. Sending power-off (or reboot) commands to the peer’s BMC achieves this goal. + + +### Workflow Description + +#### Cluster Creator Role: +* The Cluster Creator will automatically install the 2NO (by using an installer), installation process will include the following steps: + * Deploys a two-node OpenShift cluster + * Configures cluster membership and quorum using Corosync. + * Sets up Pacemaker for resource management and fencing. + +#### Application Administrator Role: +* Receives cluster credentials. +* Deploys applications within the two-node cluster environment. + +#### Failure Handling: + +1. Cold Boot + 1. One node (Node1) boots + 2. Node1 does have “corosync quorum” (C-quorum) (requires forming a membership with it’s peer) + 3. Node1 does not start etcd or kubelet, remains inert waiting for Node2 + 4. Peer (Node2) boots + 5. Corosync membership containing both nodes forms + 6. Pacemaker “starts” etcd on both nodes + * Detail, this could be a “soft”-start which allows us to determine which node has the most recent dataset. + 7. Pacemaker “promotes” etcd on whichever node has the most recent dataset + 8. Pacemaker “promotes” etcd on the peer once it has caught up + 9. Pacemaker starts kubelet on both nodes + 10. Fully functional cluster +2. Network Failure + 1. Corosync on both nodes detects separation + 2. Etcd loses internal quorum (E-quorum) and goes read-only + 3. Both sides retain C-quorum and initiate fencing of the other side. There is a different delay between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. + 4. One side wins, pre-configured as Node1 + 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 6. Cluster continues with no redundancy + 7. … time passes … + 8. Node2 boots - persistent network failure + * Node2 does not have C-quorum (requires forming a membership with it’s peer) + * Node2 does not start etcd or kubelet, remains inert waiting for Node1 + 9. Network is repaired + 10. Corosync membership containing both nodes forms + 11. Pacemaker “starts” etcd on Node2 as a follower of Node1 + 12. Pacemaker “promotes” etcd on Node2 as full replica of Node1 + 13. Pacemaker starts kubelet + 14. Cluster continues with 1+1 redundancy +3. Node Failure + 1. Corosync on the survivor (Node1) + 2. Etcd loses internal quorum (E-quorum) and goes read-only + 3. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 + 4. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 5. Cluster continues with no redundancy + 6. … time passes … + 7. Node2 has a persistent failure that prevents communication with Node1 + * Node2 does not have C-quorum (requires forming a membership with it’s peer) + * Node2 does not start etcd or kubelet, remains inert waiting for Node1 + 8. Persistent failure on Node2 is repaired + 9. Corosync membership containing both nodes forms + 10. Pacemaker “starts” etcd on Node2 as a follower of Node1 + 11. Pacemaker “promotes” etcd on Node2 as full replica of Node1 + 12. Pacemaker starts kubelet + 13. Cluster continues with 1+1 redundancy +4. Two Failures + 1. Node2 failure (1st failure) + 2. Corosync on the survivor (Node1) + 3. Etcd loses internal quorum (E-quorum) and goes read-only + 4. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 + 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 6. Cluster continues with no redundancy + 7. Node1 experience a power failure (2nd Failure) + 8. … time passes … + 9. Node1 Power restored + 10. Node1 boots but can not gain quorum before Node2 joins the cluster due to a risk of fencing loop + * Mitigation (Phase 1): manual intervention (possibly a script) in case admin can guarantee Node2 is down, which will grant Node1 quorum and restore cluster limited (none HA) functionality. + * Mitigation (Phase 2): limited automatic intervention for some use cases: for example Node1 will gain quorum only if Node2 can be verified to be down by successfully querying its BMC status. +5. Kubelet Failure + 1. Pacemaker’s monitoring detects the failure + 2. Pacemaker restarts kubelet + 3. Stop failure is optionally escalated to a node failure (fencing) + 4. Start failure defaults to leaving the service offline +6. Etcd Failure + 1. Pacemaker’s monitoring detects the failure + 2. Pacemaker either demotes etcd so it can resync, or restarts and promotes etcd + 3. Stop failure is optionally escalated to a node failure (fencing) + 4. Start failure defaults to leaving the service offline + + +### API Extensions + +API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, +and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. + +- Name the API extensions this enhancement adds or modifies. +- Does this enhancement modify the behaviour of existing resources, especially those owned + by other parties than the authoring team (including upstream resources), and, if yes, how? + Please add those other parties as reviewers to the enhancement. + + Examples: + - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. + - Restricts the label format for objects to X. + - Defaults field Y on object kind Z. + +Fill in the operational impact of these API Extensions in the "Operational Aspects +of API Extensions" section. + +### Topology Considerations + +2NO represents a new topology, and is not appropriate for use with HyperShift, SNO, or MicroShift + +#### Hypershift / Hosted Control Planes + +Are there any unique considerations for making this change work with +Hypershift? + +See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 + +How does it affect any of the components running in the +management cluster? How does it affect any components running split +between the management cluster and guest cluster? + +#### Standalone Clusters + +Is the change relevant for standalone clusters? + +#### Single-node Deployments or MicroShift + +How does this proposal affect the resource consumption of a +single-node OpenShift deployment (SNO), CPU and memory? + +How does this proposal affect MicroShift? For example, if the proposal +adds configuration options through API resources, should any of those +behaviors also be exposed to MicroShift admins through the +configuration file for MicroShift? + +### Implementation Details/Notes/Constraints + +#### Installation flow +1. We’ll set up Pacemaker and Corosync on RHCOS using MCO layering. + * [TBD extend more] +2. Install an “SNO like” first node using a second bootstrapped node. + * This is somewhat similar what is done in SNO CI in AWS (up until the part the bootstrapped node is removed) and is possible because CEO can distinguish the bootstrapped node as a special use case, thus enabling its removal without breaking the etcd quorum for the remaining node. + We should be safe after [MGMT-13586](https://issues.redhat.com/browse/MGMT-13586) which makes the installer wait for the bootstrap etcd member to be removed first before shutting it down. +3. After the bootstrapped node is removed add it to the cluster as a “regular” node. +4. Switch CEO to “2NO” mode (where it does not manage etcd) and remove the etcd static pods + * [TBD localized/global switch] + * This is done because we want to allow simpler maintenance and keeping some of CEO functionality (defragmentation, cert rotation ect…) +5. Configure Pacemaker/Corosync to manage etcd/kubelet/cri-o +6. [TBD storage] + +#### Fencing / quorum management +Fencing quorum is managed by corosync and etcd will be managed by Pacemaker which will force the etcd quorum. + +Here is a node failure example demonstrating that: +1. Corosync on the survivor (Node1) +2. Etcd loses internal quorum (E-quorum) and goes read-only +3. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 +4. Once fencing is successful Pacemaker will use a fence/resource agent (TBD) which will reschedule the workload from the fenced node +5. Pacemaker on Node1 forces E-quorum (etcd promotion event) +6. Cluster continues with no redundancy +7. … time passes … +8. Node2 has a persistent failure that prevents communication with Node1 + * Node2 does not have C-quorum (requires forming a membership with it’s peer) + * Node2 does not start etcd or kubelet, remains inert waiting for Node1 +9. Persistent failure on Node2 is repaired +10. Corosync membership containing both nodes forms +11. Pacemaker “starts” etcd on Node2 as a follower of Node1 +12. Pacemaker “promotes” etcd on Node2 as full replica of Node1 +13. Pacemaker starts kubelet +14. Cluster continues with 1+1 redundancy + +[Here](#failure-handling) is a more extensive list of failure scenarios. + +#### CEO Enhancement +1. Requires a new infrastructure type in OpenShift APIs +2. Make sure that even though CEO will not manage etcd, it will still retain other relevant capabilities (defragmentation, certificate rotation, backup/restore etc...). +3. Some functionality to “know” when to switch to “2NO mode” + +### Risks and Mitigations + +#### Risks: + +1. In the event of a node failure, Pacemaker on the survivor will fence the other node and cause etcd to recover quorum. However this will not automatically recover affected workloads. [mitigation](#scheduling-workload-on-fenced-nodes) +2. We plan to configure Pacemaker to manage etcd and give it quorum (for example after the remaining node fence its peer in a failed node use case) [mitigation](#pacemaker-controlling-key-elements) + 1. How do we plan Pacemaker to give the etcd quorum and which consideration should be taken ? + 2. How does Pacemaker giving etcd quorum affect other etcd stakeholders (etcd pod, etcd operator, etc…) ? +3. Having etcd/kubelet/cri-o managed by Pacemaker is a major change, it should be particularly considered in the installation process. Having a different process that manages those key services may cause timing issues, race conditions and potentially break some assumptions relevant to cluster installations. How does bootstrapping Pacemaker to manage etcd/kubelet/cri-o affects different installers processes (i.e assistant installer , agent base installer, etc) ? [mitigation](#unique-bootstrapping-affecting-installation-process) + 1. **CEO (Cluster Etcd Operator)/Pacemaker Conflict:** + Since we plan to use Pacemaker to manage etcd, we need to make sure we prevent the current management done by the CEO. + 2. **Bootstrap Problem:** when only 2 nodes are used for the installation process one of them serves as a bootstrap node so once this node isn’t part of the cluster etcd will lose quorum. + 3. **Setting 2NO resources:** how do we plan to get specific 2NO resources (pacemaker, corosync, etc…) on the node ? +4. Some Lifecycle events may reboot the node as part of the normal process (applying a disk image, updating ssh auth keys, configuration changes etc…). In a 2NO setup each node expects its peer to be up and will try to power fence it in case it isn’t because of that, reboot events may trigger unnecessary fencing with unexpected consequences. [mitigation](#non-failure-node-reboots) + + +#### Mitigations: + +##### Scheduling workload on fenced nodes + 1. **[Preferred Mitigation]** Pacemaker will utilize **resource/fence agents** to do the following: + 1. Before Pacemaker starts fencing of the faulty node it would place a “No Execute” taint on that node. This taint will prevent any new workload from running on the fenced node. + 2. After fencing is successful, Pacemaker will place an “Out Of Service” taint on the faulty node, which will trigger the removal of that workload and rescheduling on the it’s healthy peer node. + 3. Once the unhealthy node regains health and joins the cluster Pacemaker will remove both of these taints. + +
**Other alternatives** + 2. After Pacemaker has successfully fenced the faulty node it can mark the fenced node thus allowing a different operator to manage the rescheduling of the workload. + 3. Integrate NHC & a remediation agent ? (if so, NHC needs to be coordinated with Pacemaker in order to make sure we don’t needlessly fence the node multiple times ) + +##### Pacemaker controlling key elements + 1. Consult with relevant area experts (etcd, cri-o , kubelet etc…) + 2. Verify solution with extensive testing + +##### Unique bootstrapping affecting installation process + + 1. **CEO (Cluster Etcd Operator)/Pacemaker Conflict:** + + 1. **[Preferred Mitigation]** Add a “disable” or a “2NO” feature to CEO + 1. Requires a new infrastructure type in OpenShift APIs + 2. 2NO installation needs to work with CEO up to the point where corosync wants to take over + 3. We need a signal to CEO when it should relinquish its control to corosync - new field in the cluster/etcd CRD? + 4. How can we replicate the functionality of CEO that is tied to static pods? e.g. Certificate rotation, backup/restore, apiserver<>etcd endpoint controller + 5. Do we want this as a localized switch (i.e for example as a flag in etcd CRD) or as a global option that might serve other 2NO stakeholders as well ? + + **Note**: CEO alternatively could also remove ONLY the etcd container from its static pod definition + +
**Other alternatives** + 2. Scale down CEO Deployment replica after bootstrapping. The downside is that we need to figure out how to get etcd upgrades, as they will be blocked. + 3. Add a “disable CEO” feature to CVO (Downside is that other CEO functionalities will be needed to be managed) + + 2. **Bootstrap Problem:** Potential approaches to solve this: + + 1. **[Preferred Mitigation]** Install an “SNO like” first node using a second bootstrapped node.
This is somewhat similar what is done in SNO CI in AWS (up until the part the bootstrapped node is removed) and is possible because CEO can distinguish the bootstrapped node as a special use case, thus enabling its removal without breaking the etcd quorum for the remaining node. + We should be safe after [MGMT-13586](https://issues.redhat.com/browse/MGMT-13586) which makes the installer wait for the bootstrap etcd member to be removed first before shutting it down. + +
**Other alternatives** + 2. As part of the installation process configure corosync/pacemaker to manage etcd so that we can make sure having the bootstrap node does not cause etcd to lose quorum (or least that etcd can still regain it with only one node) + 3. It’s also worth mentioning that we’ve discussed a more simple option of using 3 nodes and taking one down, however this option is rejected because we can’t assume that a customer that wants a 2NO would have a third available node. + + 3. **Setting 2NO resources:** Potential approaches to solve this: + + 1. **[Preferred Mitigation]** Using MCO (Machine Config Operator) to layer RHCOS with 2NO resources, however this is done out of the scope of the installer so we need to verify that there aren’t any issues with that. + +
**Other alternatives** + 2. It is also worth noting that we’ve considered modifying the RHCOS to contain 2NO resources (currently there is [another initiative](https://issues.redhat.com/browse/OCPSTRAT-1628) to do so) - At the moment this option is less preferable because it would couple 2NO to RHCOS frequent release cycle as well as add the 2NO resources in other OCP components which do not require it. + 3. RHEL extensions +##### Non failure node reboots + 1. Apply MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) which allows os updates without node reboot. + 2. Potentially it’s a graceful reboot in which case Pacemaker will get a notification and can handle the reboot. + 3. Some delay mechanism ? + 4. Handle those specific use cases for a different behavior for a 2NO cluster ? + 5. Other alternatives ? + +General mitigation which apply to most of the risks are +* Early feedback from relevant experts +* Thorough testing of failure scenarios. +* Clear documentation and support procedures. + + +#### Appendix - Disabling CEO: +Features that the CEO currently takes care of: +* Static pod management during bootstrap, installation and runtime +* etcd Member addition/removal on lifecycle events of the node/machine (“vertical scaling”) +* Defragmentation +* Certificate creation and rotation +* Active etcd endpoint export for apiserver (etcd-endpoints configmap in openshift-config namespace) +* Installation of the Backup/Restore scripts + +Source as of 4.15: [CEO <> CEE](https://docs.google.com/presentation/d/1U_IyNGHCAZFAZXyzAs5XybR8qT91QaQ2wr3W9w9pSaw/edit#slide=id.g184d8fd7fc3_1_99) + + + + + +### Drawbacks + +The idea is to find the best form of an argument why this enhancement should +_not_ be implemented. + +What trade-offs (technical/efficiency cost, user experience, flexibility, +supportability, etc) must be made in order to implement this? What are the reasons +we might not want to undertake this proposal, and how do we overcome them? + +Does this proposal implement a behavior that's new/unique/novel? Is it poorly +aligned with existing user expectations? Will it be a significant maintenance +burden? Is it likely to be superceded by something else in the near future? + +## Open Questions [optional] + +This is where to call out areas of the design that require closure before deciding +to implement the design. For instance, + > 1. This requires exposing previously private resources which contain sensitive + information. Can we do this? + +## Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels + - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] + - `Dev Preview`, `Tech Preview`, `GA` in OpenShift +- [Deprecation policy][deprecation-policy] + +Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), +or by redefining what graduation means. + +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. + +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ + +**If this is a user facing change requiring new or updated documentation in [openshift-docs](https://github.com/openshift/openshift-docs/), +please be sure to include in the graduation criteria.** + +**Examples**: These are generalized examples to consider, in addition +to the aforementioned [maturity levels][maturity-levels]. + +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +## Upgrade / Downgrade Strategy + +In-place upgrades and downgrades will not be supported for this first iteration, and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. + +## Version Skew Strategy + +How will the component handle version skew with other components? +What are the guarantees? Make sure this is in the test plan. + +Consider the following in developing a version skew strategy for this +enhancement: +- During an upgrade, we will always have skew among components, how will this impact your work? +- Does this enhancement involve coordinating behavior in the control plane and + in the kubelet? How does an n-2 kubelet without this feature available behave + when this feature is used? +- Will any other components on the node change? For example, changes to CSI, CRI + or CNI may require updating that component before the kubelet. + +## Operational Aspects of API Extensions + +Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs, +admission and conversion webhooks, aggregated API servers, finalizers) here in detail, +especially how they impact the OCP system architecture and operational aspects. + +- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level + Indicators) an administrator or support can use to determine the health of the API extensions + + Examples (metrics, alerts, operator conditions) + - authentication-operator condition `APIServerDegraded=False` + - authentication-operator condition `APIServerAvailable=True` + - openshift-authentication/oauth-apiserver deployment and pods health + +- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, + API availability) + + Examples: + - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average. + - Fails creation of ConfigMap in the system when the webhook is not available. + - Adds a dependency on the SDN service network for all resources, risking API availability in case + of SDN issues. + - Expected use-cases require less than 1000 instances of the CRD, not impacting + general API throughput. + +- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or + automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review + this enhancement) + +- Describe the possible failure modes of the API extensions. +- Describe how a failure or behaviour of the extension will impact the overall cluster health + (e.g. which kube-controller-manager functionality will stop working), especially regarding + stability, availability, performance and security. +- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes + and add them as reviewers to this enhancement. + +## Support Procedures + +Describe how to +- detect the failure modes in a support situation, describe possible symptoms (events, metrics, + alerts, which log output in which component) + + Examples: + - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz". + - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed". + - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")` + will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire. + +- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) + + - What consequences does it have on the cluster health? + + Examples: + - Garbage collection in kube-controller-manager will stop working. + - Quota will be wrongly computed. + - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data. + Disabling the conversion webhook will break garbage collection. + + - What consequences does it have on existing, running workloads? + + Examples: + - New namespaces won't get the finalizer "xyz" and hence might leak resource X + when deleted. + - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod + communication after some minutes. + + - What consequences does it have for newly created workloads? + + Examples: + - New pods in namespace with Istio support will not get sidecars injected, breaking + their networking. + +- Does functionality fail gracefully and will work resume when re-enabled without risking + consistency? + + Examples: + - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence + will not block the creation or updates on objects when it fails. When the + webhook comes back online, there is a controller reconciling all objects, applying + labels that were not applied during admission webhook downtime. + - Namespaces deletion will not delete all objects in etcd, leading to zombie + objects when another namespace with the same name is created. + +## Alternatives + +* MicroShift was considered as an alternative but it was ruled out because it does not support multi node has a very different experience then OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes + + +* 2 SNO + KCP +[KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control plane, reducing the complexity of managing each cluster independently. +With kcp, you can manage the two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. +The main advantage of this approach is that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. +Disadvantages: +* Production readiness +* KCP itself could become a single point of failure (need to configure pacemaker to manage KCP) +* KCP adds an additional layer of complexity to the architecture + + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. From fd810fde04073742256ec79bfa3b05c7c8d18c2f Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Thu, 3 Oct 2024 21:48:19 +1000 Subject: [PATCH 02/49] Adjust the emphasis of the enhancement. Focus on the elements other teams are most likely to be interested in and adjust the level of detail --- enhancements/two-nodes-openshift/2no.md | 83 +++++++++++++------------ 1 file changed, 43 insertions(+), 40 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 53ba92413f..d765166beb 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -70,68 +70,71 @@ MCO - Machine Config Operator. This operator manages updates to systemd, cri-o/k ABI - Agent-Based Installer. -ZTP - Zero-Touch Provisioning. - - ## Summary -The Two Nodes OpenShift (2NO) initiative aims to provide a container management solution with a minimal footprint suitable for customers with numerous geographically dispersed locations. -Traditional three-node setups represent significant infrastructure costs, making them cost-prohibitive at retail and telco scale. This proposal outlines how we can implement a two-node OpenShift cluster while retaining the ability to survive a node failure. +Leverage traditional high-availability concepts and technologies to provide a container management solution suitable for customers with numerous geographically dispersed locations that has a minimal footprint but remains resilient to single node-level failures. ## Motivation -Customers with tens-of-thousands of geographically dispersed locations seek a container management solution that retains some level of resilience but does not come with a traditional three-node footprint. Even "cheap" third nodes represent a significant cost at this scale. -The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. As the distance between a site and the central management hub grows, the number of servers at the site tends to shrink. The most distant sites often have physical space for only one or two servers. -We are seeing an emerging pattern where some infrastructure providers and application owners desire a consistent deployment approach for their workloads across these disparate environments. They also require that the edge sites operate independently from the central management hub. Users who have adopted Kubernetes at their central management sites wish to extend this independence to remote sites through the deployment of independent Kubernetes clusters. -For example, in the telecommunications industry, particularly within 5G Radio Access Networks (RAN), there is a growing trend toward cloud-native implementations of the 5G Distributed Unit (DU) component. This component, due to latency constraints, must be deployed close to the radio antenna, sometimes on a single server at remote locations like the base of a cell tower or in a datacenter-like environment serving multiple base stations. -A hypothetical DU might require 20 dedicated cores, 24 GiB of RAM consumed as huge pages, multiple SR-IOV NICs carrying several Gbps of traffic each, and specialized accelerator devices. The node hosting this workload must run a real-time kernel, be carefully tuned to meet low-latency requirements, and support features like Precision Timing Protocol (PTP). Crucially, the "cloud" hosting this workload must be autonomous, capable of continuing to operate with its existing configuration and running workloads even when centralized management functionality is unavailable. -Given these factors, a two-node deployment of OpenShift offers a consistent, reliable solution that meets the needs of customers across all their sites, from central management hubs to the most remote edge locations. +Customers with hundreds, or even tens-of-thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience to node level failures, but does not come with a traditional three-node footprint and/or price tag. + +The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. +The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. +This requires our solution provide a management experience consistent with "normal" OpenShift deployments, and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. ### User Stories + * As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. -* As a support engineer, I want an automated method for handling the failure of a single node so that I can quickly restore service and maintain system integrity. -* As an infrastructure administrator, I want to ensure seamless failover for virtual machines (VMs) so that in the event of a node failure, the VMs are automatically migrated to a healthy node with minimal downtime and no data loss. -* As a network operator, I want my Cloud-Native Network Functions (CNFs) to be orchestrated consistently using OpenShift, regardless of whether they are in datacenters, or at the far edge where physical space is limited. +* As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of workloads is minimized. ### Goals -* Implement a highly available two-node OpenShift cluster. -* Ensure cluster stability and operational efficiency. -* Provide clear methods for node failure management and recovery. -* Identify and integrate with a technology or partner that can provide storage in a two-node environment. +* Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two node cluster +* Provide an OpenShift cluster experience that is identical to that of a 3-node hyperconverged cluster, but with 2 nodes +* Prevent both data corruption and divergent datasets in etcd +* Prevent the possibility of fencing loops, wherein each node powers cycles it's peer after booting +* Recover the API server in less than 60s, as measured from the surviving node's detection of a failure +* Minimize any differences to the primary OpenShift platforms +* Avoid any decisions that would prevent upgrade/downgrade paths between two-node and traditional architectures ### Non-Goals -* Reliance on traditional third-node or SNO setups. -* Make sure we don't prevent upgrade/downgrade paths between two-node and traditional architectures -* Adding worker nodes +* Workload resilience - see related enhancement [link] +* Resilient storage - see follow-up enhancement * Support for platforms other than bare metal including automated ci testing * Support for other topologies (eg. hypershift) -* Failover time: if the leading node goes down, the remaining nodes takes over and gains operational state (writable) in less than 60s -* support full recovery of the workload when the node comes back online after restoration - total time under 15 minutes - +* Adding worker nodes ## Proposal +Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivered supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. +We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. + +Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fencing) unreachable peers to ensuring that they can do no harm. + +The delivery of RHEL-HA components will either be: + +* as an MCO Layer (targeting GA in 4.19), +* as an extension (supported today), or +* included, but inactive, in the base image + +Configuration of the RHEL-HA components will be via one or more MachineConfigs, and will require RedFish details from the installer. + + +Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and restart etcd as a new cluster of one. + +Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and restart etcd as a new cluster of one. + +Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. +If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. + +When starting etcd, the OCF script will use the cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. + +OpenShift upgrades are not supported in a degraded state, and will only proceed when both peers are online. -To achieve a two-node OpenShift cluster, we are leveraging traditional high-availability concepts and technologies. The proposed solution includes: - -1. Leverage of the Full RHEL-HA Stack: - * Run the RHEL-HA stack “under to kubelet” (directly on the hardware, not as an OpenShift workload) - * Corosync for super fast failure detection, membership calculations, which in turn will trigger Pacemaker to apply Fencing based on Corosync quorum/membership information. - * Pacemaker for integrating membership and quorum information, driving fencing, and managing if/when kubelet and etcd can be started - * Pacemaker models kubelet and cri-o as a clone (much like a ReplicaSet) and etcd as a “promotable clone” (think a construct designed for leader/follower style services). Together with fencing and quorum, this ensures that an isolated node that reboots is inert and can do no harm. - * Pacemaker is [configured](//TODO mshitrit add link) to manage etcd/cri-o/kubelet, it will start/stop/restart those services using a script or an executable. - * Pacemaker does not understand what it is managing, and expects an executable or script that knows how to start/stop/monitor (and optionally promote/demote) the service. - 1. Likely we would need to create one for etcd - 2. For kubelet and cri-o we can likely use the existing systemd unit file -2. Failure Scenarios: - * Implement detailed handling procedures for cold boots, network failures, node failures, kubelet failures, and etcd failures using the RHEL-HA stack. - [see examples](#failure-handling) -3. Fencing Methods: - * We plan to use Baseboard Management Controller (BMC) as our primary fencing method, the premise of using BMC for fencing is that a node that is powered off, or was previously powered off and configured to be inert until quorum forms, is not in a position to cause corruption or diverging datasets. Sending power-off (or reboot) commands to the peer’s BMC achieves this goal. +MachineConfig updates are not applied in a degraded state, and will only proceed when both peers are online. ### Workflow Description From 188d9e43bbd5387467b55f391110d675cad382fd Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Mon, 7 Oct 2024 20:12:24 +1100 Subject: [PATCH 03/49] Cleanup implementation details and risks --- enhancements/two-nodes-openshift/2no.md | 233 +++++------------------- 1 file changed, 49 insertions(+), 184 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index d765166beb..65d816bb0c 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -39,7 +39,7 @@ tracking-link: - https://issues.redhat.com/browse/OCPSTRAT-1514 --- -# Two Nodes Openshift (2NO) +# Two Nodes Openshift (2NO) - Control Plane Availability ## Terms @@ -85,19 +85,19 @@ This requires our solution provide a management experience consistent with "norm ### User Stories - * As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. * As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of workloads is minimized. ### Goals +* Provide a two-node control plane for physical hardware that is resilient to a node-level failure for either node * Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two node cluster -* Provide an OpenShift cluster experience that is identical to that of a 3-node hyperconverged cluster, but with 2 nodes * Prevent both data corruption and divergent datasets in etcd -* Prevent the possibility of fencing loops, wherein each node powers cycles it's peer after booting -* Recover the API server in less than 60s, as measured from the surviving node's detection of a failure +* Maintain the existing level of availability. Eg. by avoiding fencing loops, wherein each node powers cycles it's peer after booting, reducing the cluster's availability. +* Recover the API server in less than 120s, as measured from the surviving node's detection of a failure * Minimize any differences to the primary OpenShift platforms * Avoid any decisions that would prevent upgrade/downgrade paths between two-node and traditional architectures +* Provide an OpenShift cluster experience that is identical to that of a 3-node hyperconverged cluster, but with 2 nodes ### Non-Goals @@ -112,7 +112,7 @@ This requires our solution provide a management experience consistent with "norm Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivered supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. -Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fencing) unreachable peers to ensuring that they can do no harm. +Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) unreachable peers and ensure that they can do no harm while the remaining node continues. The delivery of RHEL-HA components will either be: @@ -122,10 +122,9 @@ The delivery of RHEL-HA components will either be: Configuration of the RHEL-HA components will be via one or more MachineConfigs, and will require RedFish details from the installer. +Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster of one. -Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and restart etcd as a new cluster of one. - -Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and restart etcd as a new cluster of one. +Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster of one. Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. @@ -149,7 +148,32 @@ MachineConfig updates are not applied in a degraded state, and will only proceed * Receives cluster credentials. * Deploys applications within the two-node cluster environment. -#### Failure Handling: + +### API Extensions + +No new CRDs, or changes to existing CRDs, are expected at this time. + +### Topology Considerations + +2NO represents a new topology, and is not appropriate for use with HyperShift, SNO, or MicroShift + +#### Standalone Clusters + +Is the change relevant for standalone clusters? +TODO: Exactly what is the definition of a standalone cluster? Disconnected? Physical hardware? + + +### Implementation Details/Notes/Constraints + +While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow, and then using Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. + +Initially the creation of an etcd cluster will be driven in the same way as other platforms. +Once the cluster has two members, the etcd daemon will be removed from the static pod and become controlled by RHEL-HA. +At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. +The exact mechanism for this communication has yet to be determined. + + +#### Failure Scenario Timelines: 1. Cold Boot 1. One node (Node1) boots @@ -221,185 +245,26 @@ MachineConfig updates are not applied in a degraded state, and will only proceed 4. Start failure defaults to leaving the service offline -### API Extensions - -API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, -and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. - -- Name the API extensions this enhancement adds or modifies. -- Does this enhancement modify the behaviour of existing resources, especially those owned - by other parties than the authoring team (including upstream resources), and, if yes, how? - Please add those other parties as reviewers to the enhancement. - - Examples: - - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. - - Restricts the label format for objects to X. - - Defaults field Y on object kind Z. - -Fill in the operational impact of these API Extensions in the "Operational Aspects -of API Extensions" section. - -### Topology Considerations - -2NO represents a new topology, and is not appropriate for use with HyperShift, SNO, or MicroShift - -#### Hypershift / Hosted Control Planes - -Are there any unique considerations for making this change work with -Hypershift? - -See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 - -How does it affect any of the components running in the -management cluster? How does it affect any components running split -between the management cluster and guest cluster? - -#### Standalone Clusters - -Is the change relevant for standalone clusters? - -#### Single-node Deployments or MicroShift - -How does this proposal affect the resource consumption of a -single-node OpenShift deployment (SNO), CPU and memory? +### Risks and Mitigations -How does this proposal affect MicroShift? For example, if the proposal -adds configuration options through API resources, should any of those -behaviors also be exposed to MicroShift admins through the -configuration file for MicroShift? +Risk: If etcd were to be made active on both peers during a network split, divergent datasets would be created +Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster of one +Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer -### Implementation Details/Notes/Constraints +Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain +Mitigation: The CEO will run in a mode that does manage not etcd membership -#### Installation flow -1. We’ll set up Pacemaker and Corosync on RHCOS using MCO layering. - * [TBD extend more] -2. Install an “SNO like” first node using a second bootstrapped node. - * This is somewhat similar what is done in SNO CI in AWS (up until the part the bootstrapped node is removed) and is possible because CEO can distinguish the bootstrapped node as a special use case, thus enabling its removal without breaking the etcd quorum for the remaining node. - We should be safe after [MGMT-13586](https://issues.redhat.com/browse/MGMT-13586) which makes the installer wait for the bootstrap etcd member to be removed first before shutting it down. -3. After the bootstrapped node is removed add it to the cluster as a “regular” node. -4. Switch CEO to “2NO” mode (where it does not manage etcd) and remove the etcd static pods - * [TBD localized/global switch] - * This is done because we want to allow simpler maintenance and keeping some of CEO functionality (defragmentation, cert rotation ect…) -5. Configure Pacemaker/Corosync to manage etcd/kubelet/cri-o -6. [TBD storage] - -#### Fencing / quorum management -Fencing quorum is managed by corosync and etcd will be managed by Pacemaker which will force the etcd quorum. - -Here is a node failure example demonstrating that: -1. Corosync on the survivor (Node1) -2. Etcd loses internal quorum (E-quorum) and goes read-only -3. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 -4. Once fencing is successful Pacemaker will use a fence/resource agent (TBD) which will reschedule the workload from the fenced node -5. Pacemaker on Node1 forces E-quorum (etcd promotion event) -6. Cluster continues with no redundancy -7. … time passes … -8. Node2 has a persistent failure that prevents communication with Node1 - * Node2 does not have C-quorum (requires forming a membership with it’s peer) - * Node2 does not start etcd or kubelet, remains inert waiting for Node1 -9. Persistent failure on Node2 is repaired -10. Corosync membership containing both nodes forms -11. Pacemaker “starts” etcd on Node2 as a follower of Node1 -12. Pacemaker “promotes” etcd on Node2 as full replica of Node1 -13. Pacemaker starts kubelet -14. Cluster continues with 1+1 redundancy - -[Here](#failure-handling) is a more extensive list of failure scenarios. - -#### CEO Enhancement -1. Requires a new infrastructure type in OpenShift APIs -2. Make sure that even though CEO will not manage etcd, it will still retain other relevant capabilities (defragmentation, certificate rotation, backup/restore etc...). -3. Some functionality to “know” when to switch to “2NO mode” - -### Risks and Mitigations +Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites +Mitigation: Lifecycle events, such as upgrades and applying new MachineConfigs, are not permitted in a single-node degraded state +Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will futher reduce the need for reboots. +Mitigation: The node will be reachable via SSH and the confirmation can be scripted +Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. -#### Risks: - -1. In the event of a node failure, Pacemaker on the survivor will fence the other node and cause etcd to recover quorum. However this will not automatically recover affected workloads. [mitigation](#scheduling-workload-on-fenced-nodes) -2. We plan to configure Pacemaker to manage etcd and give it quorum (for example after the remaining node fence its peer in a failed node use case) [mitigation](#pacemaker-controlling-key-elements) - 1. How do we plan Pacemaker to give the etcd quorum and which consideration should be taken ? - 2. How does Pacemaker giving etcd quorum affect other etcd stakeholders (etcd pod, etcd operator, etc…) ? -3. Having etcd/kubelet/cri-o managed by Pacemaker is a major change, it should be particularly considered in the installation process. Having a different process that manages those key services may cause timing issues, race conditions and potentially break some assumptions relevant to cluster installations. How does bootstrapping Pacemaker to manage etcd/kubelet/cri-o affects different installers processes (i.e assistant installer , agent base installer, etc) ? [mitigation](#unique-bootstrapping-affecting-installation-process) - 1. **CEO (Cluster Etcd Operator)/Pacemaker Conflict:** - Since we plan to use Pacemaker to manage etcd, we need to make sure we prevent the current management done by the CEO. - 2. **Bootstrap Problem:** when only 2 nodes are used for the installation process one of them serves as a bootstrap node so once this node isn’t part of the cluster etcd will lose quorum. - 3. **Setting 2NO resources:** how do we plan to get specific 2NO resources (pacemaker, corosync, etc…) on the node ? -4. Some Lifecycle events may reboot the node as part of the normal process (applying a disk image, updating ssh auth keys, configuration changes etc…). In a 2NO setup each node expects its peer to be up and will try to power fence it in case it isn’t because of that, reboot events may trigger unnecessary fencing with unexpected consequences. [mitigation](#non-failure-node-reboots) - - -#### Mitigations: - -##### Scheduling workload on fenced nodes - 1. **[Preferred Mitigation]** Pacemaker will utilize **resource/fence agents** to do the following: - 1. Before Pacemaker starts fencing of the faulty node it would place a “No Execute” taint on that node. This taint will prevent any new workload from running on the fenced node. - 2. After fencing is successful, Pacemaker will place an “Out Of Service” taint on the faulty node, which will trigger the removal of that workload and rescheduling on the it’s healthy peer node. - 3. Once the unhealthy node regains health and joins the cluster Pacemaker will remove both of these taints. - -
**Other alternatives** - 2. After Pacemaker has successfully fenced the faulty node it can mark the fenced node thus allowing a different operator to manage the rescheduling of the workload. - 3. Integrate NHC & a remediation agent ? (if so, NHC needs to be coordinated with Pacemaker in order to make sure we don’t needlessly fence the node multiple times ) - -##### Pacemaker controlling key elements - 1. Consult with relevant area experts (etcd, cri-o , kubelet etc…) - 2. Verify solution with extensive testing - -##### Unique bootstrapping affecting installation process - - 1. **CEO (Cluster Etcd Operator)/Pacemaker Conflict:** - - 1. **[Preferred Mitigation]** Add a “disable” or a “2NO” feature to CEO - 1. Requires a new infrastructure type in OpenShift APIs - 2. 2NO installation needs to work with CEO up to the point where corosync wants to take over - 3. We need a signal to CEO when it should relinquish its control to corosync - new field in the cluster/etcd CRD? - 4. How can we replicate the functionality of CEO that is tied to static pods? e.g. Certificate rotation, backup/restore, apiserver<>etcd endpoint controller - 5. Do we want this as a localized switch (i.e for example as a flag in etcd CRD) or as a global option that might serve other 2NO stakeholders as well ? - - **Note**: CEO alternatively could also remove ONLY the etcd container from its static pod definition - -
**Other alternatives** - 2. Scale down CEO Deployment replica after bootstrapping. The downside is that we need to figure out how to get etcd upgrades, as they will be blocked. - 3. Add a “disable CEO” feature to CVO (Downside is that other CEO functionalities will be needed to be managed) - - 2. **Bootstrap Problem:** Potential approaches to solve this: - - 1. **[Preferred Mitigation]** Install an “SNO like” first node using a second bootstrapped node.
This is somewhat similar what is done in SNO CI in AWS (up until the part the bootstrapped node is removed) and is possible because CEO can distinguish the bootstrapped node as a special use case, thus enabling its removal without breaking the etcd quorum for the remaining node. - We should be safe after [MGMT-13586](https://issues.redhat.com/browse/MGMT-13586) which makes the installer wait for the bootstrap etcd member to be removed first before shutting it down. - -
**Other alternatives** - 2. As part of the installation process configure corosync/pacemaker to manage etcd so that we can make sure having the bootstrap node does not cause etcd to lose quorum (or least that etcd can still regain it with only one node) - 3. It’s also worth mentioning that we’ve discussed a more simple option of using 3 nodes and taking one down, however this option is rejected because we can’t assume that a customer that wants a 2NO would have a third available node. - - 3. **Setting 2NO resources:** Potential approaches to solve this: - - 1. **[Preferred Mitigation]** Using MCO (Machine Config Operator) to layer RHCOS with 2NO resources, however this is done out of the scope of the installer so we need to verify that there aren’t any issues with that. - -
**Other alternatives** - 2. It is also worth noting that we’ve considered modifying the RHCOS to contain 2NO resources (currently there is [another initiative](https://issues.redhat.com/browse/OCPSTRAT-1628) to do so) - At the moment this option is less preferable because it would couple 2NO to RHCOS frequent release cycle as well as add the 2NO resources in other OCP components which do not require it. - 3. RHEL extensions -##### Non failure node reboots - 1. Apply MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) which allows os updates without node reboot. - 2. Potentially it’s a graceful reboot in which case Pacemaker will get a notification and can handle the reboot. - 3. Some delay mechanism ? - 4. Handle those specific use cases for a different behavior for a 2NO cluster ? - 5. Other alternatives ? - -General mitigation which apply to most of the risks are -* Early feedback from relevant experts -* Thorough testing of failure scenarios. -* Clear documentation and support procedures. - - -#### Appendix - Disabling CEO: -Features that the CEO currently takes care of: -* Static pod management during bootstrap, installation and runtime -* etcd Member addition/removal on lifecycle events of the node/machine (“vertical scaling”) -* Defragmentation -* Certificate creation and rotation -* Active etcd endpoint export for apiserver (etcd-endpoints configmap in openshift-config namespace) -* Installation of the Backup/Restore scripts - -Source as of 4.15: [CEO <> CEE](https://docs.google.com/presentation/d/1U_IyNGHCAZFAZXyzAs5XybR8qT91QaQ2wr3W9w9pSaw/edit#slide=id.g184d8fd7fc3_1_99) +Risk: We may not succeed in identifying all the reasons a node will reboot +Mitigation: ... testing? ... +Risk: This new platform will have a unique installation flow +Mitigation: ... CI ... From 0503553052669616f0a8acff3eae12ec6c270b6a Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 8 Oct 2024 12:29:07 +1100 Subject: [PATCH 04/49] Further refinement of workflow, proposal, and implementation details. Progress on drawbacks and questions --- enhancements/two-nodes-openshift/2no.md | 86 ++++++++++++++----------- 1 file changed, 49 insertions(+), 37 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 65d816bb0c..f8f395f22a 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -96,7 +96,7 @@ This requires our solution provide a management experience consistent with "norm * Maintain the existing level of availability. Eg. by avoiding fencing loops, wherein each node powers cycles it's peer after booting, reducing the cluster's availability. * Recover the API server in less than 120s, as measured from the surviving node's detection of a failure * Minimize any differences to the primary OpenShift platforms -* Avoid any decisions that would prevent upgrade/downgrade paths between two-node and traditional architectures +* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures * Provide an OpenShift cluster experience that is identical to that of a 3-node hyperconverged cluster, but with 2 nodes ### Non-Goals @@ -106,22 +106,16 @@ This requires our solution provide a management experience consistent with "norm * Support for platforms other than bare metal including automated ci testing * Support for other topologies (eg. hypershift) * Adding worker nodes +* Creation RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) ## Proposal Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivered supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. +etcd will run as as a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) unreachable peers and ensure that they can do no harm while the remaining node continues. -The delivery of RHEL-HA components will either be: - -* as an MCO Layer (targeting GA in 4.19), -* as an extension (supported today), or -* included, but inactive, in the base image - -Configuration of the RHEL-HA components will be via one or more MachineConfigs, and will require RedFish details from the installer. - Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster of one. Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster of one. @@ -129,29 +123,38 @@ Upon a network failure, the RHEL-HA components ensure that exactly one node will Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. -When starting etcd, the OCF script will use the cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. +When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. + +### Workflow Description -OpenShift upgrades are not supported in a degraded state, and will only proceed when both peers are online. +#### Cluster Creation -MachineConfig updates are not applied in a degraded state, and will only proceed when both peers are online. +Creation of a two node control plane will be possible via the core installer (with an additional bootstrap node), and via the Assisted Installer (without an additional bootstrap node). +In the case of the core OpenShift installer, the user-facing proceedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. +Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. -### Workflow Description +In the case of the Assisted Installer, the user-facing proceedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details for each node which are needed for the RHEL-HA configuration. + +Everything else about cluster creation will be an opaque implementation detail not exposed to the user. -#### Cluster Creator Role: -* The Cluster Creator will automatically install the 2NO (by using an installer), installation process will include the following steps: - * Deploys a two-node OpenShift cluster - * Configures cluster membership and quorum using Corosync. - * Sets up Pacemaker for resource management and fencing. +#### Day 2 Proceedures -#### Application Administrator Role: -* Receives cluster credentials. -* Deploys applications within the two-node cluster environment. +As per a standard 3-node control plane, OpenShift upgrades and MachineConfig changes can not be applied when the cluster is in a degraded state. +Such operations will only proceed when both peers are online and healthy. +The experience of managing a 2-node control plane should be largely indistinguishable from that of a 3-node one. ### API Extensions -No new CRDs, or changes to existing CRDs, are expected at this time. +Initially the creation of an etcd cluster will be driven in the same way as other platforms. +Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. +At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. +This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field on the Infrastructure CR - which can only succeed if the control-plane is healthy. + +To enable this flow, we propose the addition of a `externallyManagedEtcd` field to the `BareMetalPlatformSpec` which defaults to False. +This will limit the scope of CEO changes to that specific platform, and well as allow the use of a tightly scoped credential to make the change. +An alternative being to grant write access to all `ConfigMaps` in the `openshift-config` namespace. ### Topology Considerations @@ -162,16 +165,24 @@ No new CRDs, or changes to existing CRDs, are expected at this time. Is the change relevant for standalone clusters? TODO: Exactly what is the definition of a standalone cluster? Disconnected? Physical hardware? - ### Implementation Details/Notes/Constraints While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow, and then using Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. -Initially the creation of an etcd cluster will be driven in the same way as other platforms. -Once the cluster has two members, the etcd daemon will be removed from the static pod and become controlled by RHEL-HA. -At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. -The exact mechanism for this communication has yet to be determined. +A mechanism is needed for other components to understand that this is a 2no architecture in which etcd is externally managed and RHEL-HA is in use. +The proposed `externallyManagedEtcd` field for `BareMetalPlatformSpec` in combination with the node count may be sufficient. +Alternatively we may wish to make this explicit by creating a new feature gate. +The delivery of RHEL-HA components will be opaque to the user and either come: + +* as an MCO Layer (this feature is targeting GA in 4.19), +* as an extension (supported today), or +* included, but inactive, in the base image + +Configuration of the RHEL-HA components will be via one or more MachineConfigs, and will require RedFish details to have been collected by the installer. +Sensible defaults will be chosen where possible, and user customization only where absolutely necessary. + +Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. #### Failure Scenario Timelines: @@ -190,7 +201,8 @@ The exact mechanism for this communication has yet to be determined. 2. Network Failure 1. Corosync on both nodes detects separation 2. Etcd loses internal quorum (E-quorum) and goes read-only - 3. Both sides retain C-quorum and initiate fencing of the other side. There is a different delay between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. + 3. Both sides retain C-quorum and initiate fencing of the other side. + There is a different delay (configured as part of Pacemaker) between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. 4. One side wins, pre-configured as Node1 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) 6. Cluster continues with no redundancy @@ -271,16 +283,10 @@ Mitigation: ... CI ... ### Drawbacks -The idea is to find the best form of an argument why this enhancement should -_not_ be implemented. - -What trade-offs (technical/efficiency cost, user experience, flexibility, -supportability, etc) must be made in order to implement this? What are the reasons -we might not want to undertake this proposal, and how do we overcome them? +The two-node architecture represents yet another distinct install type for users to choose from. -Does this proposal implement a behavior that's new/unique/novel? Is it poorly -aligned with existing user expectations? Will it be a significant maintenance -burden? Is it likely to be superceded by something else in the near future? +The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. +Satisfying this demand would come with significant technical and support overhead. ## Open Questions [optional] @@ -289,6 +295,12 @@ to implement the design. For instance, > 1. This requires exposing previously private resources which contain sensitive information. Can we do this? +1. How to best deliver RHEL-HA components to the nodes is currently under discussion with the MCO team. + The answer may change as in-progress MCO features mature. +1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? + How can these be avoided? + + ## Test Plan **Note:** *Section not required until targeted at a release.* From 6ca9794f7f87ece3503935a38a6ab6556cb49d77 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 8 Oct 2024 12:32:55 +1100 Subject: [PATCH 05/49] Formating --- enhancements/two-nodes-openshift/2no.md | 39 +++++++++++-------------- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index f8f395f22a..5fe9d93b2f 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -66,7 +66,7 @@ E-quorum: quorum as determined by etcd members and algorithms Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption unless state is stored outside of the cluster. -MCO - Machine Config Operator. This operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new MachineConfig CRD that can write configuration files onto the host. +MCO - Machine Config Operator. This operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new `MachineConfig` capability that can write configuration files onto the host. ABI - Agent-Based Installer. @@ -140,7 +140,7 @@ Everything else about cluster creation will be an opaque implementation detail n #### Day 2 Proceedures -As per a standard 3-node control plane, OpenShift upgrades and MachineConfig changes can not be applied when the cluster is in a degraded state. +As per a standard 3-node control plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. Such operations will only proceed when both peers are online and healthy. The experience of managing a 2-node control plane should be largely indistinguishable from that of a 3-node one. @@ -179,7 +179,7 @@ The delivery of RHEL-HA components will be opaque to the user and either come: * as an extension (supported today), or * included, but inactive, in the base image -Configuration of the RHEL-HA components will be via one or more MachineConfigs, and will require RedFish details to have been collected by the installer. +Configuration of the RHEL-HA components will be via one or more `MachineConfig`s, and will require RedFish details to have been collected by the installer. Sensible defaults will be chosen where possible, and user customization only where absolutely necessary. Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. @@ -259,24 +259,24 @@ Tools for extracting support information (must-gather tarballs) will be updated ### Risks and Mitigations -Risk: If etcd were to be made active on both peers during a network split, divergent datasets would be created -Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster of one -Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer +1. Risk: If etcd were to be made active on both peers during a network split, divergent datasets would be created + 1. Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster of one + 1. Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer -Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain -Mitigation: The CEO will run in a mode that does manage not etcd membership +1. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain + 1. Mitigation: The CEO will run in a mode that does manage not etcd membership -Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites -Mitigation: Lifecycle events, such as upgrades and applying new MachineConfigs, are not permitted in a single-node degraded state -Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will futher reduce the need for reboots. -Mitigation: The node will be reachable via SSH and the confirmation can be scripted -Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. +1. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites + 1. Mitigation: Lifecycle events, such as upgrades and applying new `MachineConfig`s, are not permitted in a single-node degraded state + 1. Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will futher reduce the need for reboots. + 1. Mitigation: The node will be reachable via SSH and the confirmation can be scripted + 1. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. -Risk: We may not succeed in identifying all the reasons a node will reboot -Mitigation: ... testing? ... +1. Risk: We may not succeed in identifying all the reasons a node will reboot + 1. Mitigation: ... testing? ... -Risk: This new platform will have a unique installation flow -Mitigation: ... CI ... +1. Risk: This new platform will have a unique installation flow + 1. Mitigation: ... CI ... @@ -290,11 +290,6 @@ Satisfying this demand would come with significant technical and support overhea ## Open Questions [optional] -This is where to call out areas of the design that require closure before deciding -to implement the design. For instance, - > 1. This requires exposing previously private resources which contain sensitive - information. Can we do this? - 1. How to best deliver RHEL-HA components to the nodes is currently under discussion with the MCO team. The answer may change as in-progress MCO features mature. 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? From e30f327dbcf5d55db2152893e57413b4390d5be0 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 8 Oct 2024 12:52:55 +1100 Subject: [PATCH 06/49] Additional detail --- enhancements/two-nodes-openshift/2no.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 5fe9d93b2f..d3105243fa 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -120,6 +120,8 @@ Upon a peer failure, the RHEL-HA components on the surivor will fence the peer a Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster of one. +In both cases, the control-plane will be unresponsive until etcd has been restarted. + Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. @@ -202,7 +204,7 @@ Tools for extracting support information (must-gather tarballs) will be updated 1. Corosync on both nodes detects separation 2. Etcd loses internal quorum (E-quorum) and goes read-only 3. Both sides retain C-quorum and initiate fencing of the other side. - There is a different delay (configured as part of Pacemaker) between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. + There is a different delay (configured as part of Pacemaker, usually in the order of 10s of seconds) between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. 4. One side wins, pre-configured as Node1 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) 6. Cluster continues with no redundancy From 267d40ed7ab3a064b8de70a87f2ba37a5c469ff3 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 8 Oct 2024 13:12:31 +1100 Subject: [PATCH 07/49] Link to draft workload availability enahncement --- enhancements/two-nodes-openshift/2no.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index d3105243fa..c3c372b471 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -101,8 +101,8 @@ This requires our solution provide a management experience consistent with "norm ### Non-Goals -* Workload resilience - see related enhancement [link] -* Resilient storage - see follow-up enhancement +* Workload resilience - see related [Pre-DRAFT enhancement](https://docs.google.com/document/d/1TDU_4I4LP6Z9_HugeC-kaQ297YvqVJQhBs06lRIC9m8/edit) +* Resilient storage - see future enhancement * Support for platforms other than bare metal including automated ci testing * Support for other topologies (eg. hypershift) * Adding worker nodes From d9f48345940908a675a81bba62ad6580257d9d4a Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 8 Oct 2024 13:37:27 +1100 Subject: [PATCH 08/49] Additional detail requested in enhancement threads --- enhancements/two-nodes-openshift/2no.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index c3c372b471..0cd31d2f41 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -113,6 +113,8 @@ This requires our solution provide a management experience consistent with "norm Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivered supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. etcd will run as as a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. +The existing startup order of cri-o, then kubelet, then etcd will be preserved. +The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod. Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) unreachable peers and ensure that they can do no harm while the remaining node continues. @@ -184,6 +186,10 @@ The delivery of RHEL-HA components will be opaque to the user and either come: Configuration of the RHEL-HA components will be via one or more `MachineConfig`s, and will require RedFish details to have been collected by the installer. Sensible defaults will be chosen where possible, and user customization only where absolutely necessary. +The entity (likely a one-shot systemd job as part of a `MachineConfig`) that configures RHEL-HA will also configure a fencing priority. +This is usually done based on the sort-order a piece of shared info (such as IP or node name). +The priority takes the form of a delay, usually in the order of 10s of seconds, and is used to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. + Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. #### Failure Scenario Timelines: @@ -204,7 +210,7 @@ Tools for extracting support information (must-gather tarballs) will be updated 1. Corosync on both nodes detects separation 2. Etcd loses internal quorum (E-quorum) and goes read-only 3. Both sides retain C-quorum and initiate fencing of the other side. - There is a different delay (configured as part of Pacemaker, usually in the order of 10s of seconds) between the two nodes for executing the fencing operation to avoid both fencing operations to succeed in parallel and thus shutting down the system completely. + RHEL-HA's fencing priority avoids parallel fencing operations and thus the total shutdown of the system. 4. One side wins, pre-configured as Node1 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) 6. Cluster continues with no redundancy From 49b80de368f6850a7ba74dc890abe3ee2b6cd3ad Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Wed, 9 Oct 2024 11:51:18 +1100 Subject: [PATCH 09/49] Include the coldboot by a single peer in the workflow --- enhancements/two-nodes-openshift/2no.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 0cd31d2f41..bc39cb95cc 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -118,9 +118,9 @@ The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) unreachable peers and ensure that they can do no harm while the remaining node continues. -Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster of one. +Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. -Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster of one. +Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster-of-one. In both cases, the control-plane will be unresponsive until etcd has been restarted. @@ -148,6 +148,12 @@ As per a standard 3-node control plane, OpenShift upgrades and `MachineConfig` c Such operations will only proceed when both peers are online and healthy. The experience of managing a 2-node control plane should be largely indistinguishable from that of a 3-node one. +The primary exception is (re)booting one of the peers while the other is offline, and expected to remain so. + +As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. +Should the admin wish for the control-plane to start, the admin will need to execute a supplied confirmation command on the active cluster node. +This command will grant quorum to the RHEL-HA components, authorizing it to fence it's peer and start etcd in as a cluster-of-one read/write mode. +Confirmation can be given at any point and optionally make use of SSH to facilitate initiation by an external script. ### API Extensions @@ -268,7 +274,7 @@ Tools for extracting support information (must-gather tarballs) will be updated ### Risks and Mitigations 1. Risk: If etcd were to be made active on both peers during a network split, divergent datasets would be created - 1. Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster of one + 1. Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster-of-one 1. Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer 1. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain From f4e461109856843db19014b55dc06aa3f9d8d6a0 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Thu, 10 Oct 2024 14:02:17 +1100 Subject: [PATCH 10/49] Flag the risk of surprise reboots --- enhancements/two-nodes-openshift/2no.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index bc39cb95cc..936e21b820 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -286,6 +286,10 @@ Tools for extracting support information (must-gather tarballs) will be updated 1. Mitigation: The node will be reachable via SSH and the confirmation can be scripted 1. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. +1. Risk: “Something changed, lets reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. + 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. + This may require an additional enhancement. + 1. Risk: We may not succeed in identifying all the reasons a node will reboot 1. Mitigation: ... testing? ... From e959600ac36f37361506021a8e73e7d8f37d8996 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Thu, 10 Oct 2024 14:05:27 +1100 Subject: [PATCH 11/49] Formatting --- enhancements/two-nodes-openshift/2no.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 936e21b820..59ba5e738e 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -86,11 +86,11 @@ This requires our solution provide a management experience consistent with "norm ### User Stories * As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. -* As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of workloads is minimized. +* As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized. ### Goals -* Provide a two-node control plane for physical hardware that is resilient to a node-level failure for either node +* Provide a two-node control-plane for physical hardware that is resilient to a node-level failure for either node * Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two node cluster * Prevent both data corruption and divergent datasets in etcd * Maintain the existing level of availability. Eg. by avoiding fencing loops, wherein each node powers cycles it's peer after booting, reducing the cluster's availability. @@ -133,7 +133,7 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte #### Cluster Creation -Creation of a two node control plane will be possible via the core installer (with an additional bootstrap node), and via the Assisted Installer (without an additional bootstrap node). +Creation of a two node control-plane will be possible via the core installer (with an additional bootstrap node), and via the Assisted Installer (without an additional bootstrap node). In the case of the core OpenShift installer, the user-facing proceedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. @@ -144,10 +144,10 @@ Everything else about cluster creation will be an opaque implementation detail n #### Day 2 Proceedures -As per a standard 3-node control plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. +As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. Such operations will only proceed when both peers are online and healthy. -The experience of managing a 2-node control plane should be largely indistinguishable from that of a 3-node one. +The experience of managing a 2-node control-plane should be largely indistinguishable from that of a 3-node one. The primary exception is (re)booting one of the peers while the other is offline, and expected to remain so. As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. @@ -401,7 +401,7 @@ What are the guarantees? Make sure this is in the test plan. Consider the following in developing a version skew strategy for this enhancement: - During an upgrade, we will always have skew among components, how will this impact your work? -- Does this enhancement involve coordinating behavior in the control plane and +- Does this enhancement involve coordinating behavior in the control-plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used? - Will any other components on the node change? For example, changes to CSI, CRI @@ -496,7 +496,7 @@ Describe how to * 2 SNO + KCP -[KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control plane, reducing the complexity of managing each cluster independently. +[KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. With kcp, you can manage the two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. The main advantage of this approach is that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. Disadvantages: From 3a4aeba221cc798f93c798d39d876bc419808dc5 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Thu, 10 Oct 2024 14:07:30 +1100 Subject: [PATCH 12/49] Formatting --- enhancements/two-nodes-openshift/2no.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 59ba5e738e..620df2e797 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -61,8 +61,8 @@ Fence Agent - Fence agents were developed as device "drivers" which are able to
[more context here](https://github.com/ClusterLabs/fence-agents/) Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. -C-quorum: quorum as determined by Corosync members and algorithms -E-quorum: quorum as determined by etcd members and algorithms + * C-quorum: quorum as determined by Corosync members and algorithms + * E-quorum: quorum as determined by etcd members and algorithms Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption unless state is stored outside of the cluster. From e010dfcabf6bcd078dded6ac144f4702f4476c87 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Fri, 11 Oct 2024 09:15:27 +1100 Subject: [PATCH 13/49] Incorporate review feedback --- enhancements/two-nodes-openshift/2no.md | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 620df2e797..b4914d83c4 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -52,21 +52,15 @@ Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemake Resource Agent - A resource agent is an executable that manages a cluster resource. No formal definition of a cluster resource exists, other than "anything a cluster manages is a resource." Cluster resources can be as diverse as IP addresses, file systems, database services, and entire virtual machines - to name just a few examples.
[more context here](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc) -Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm or interference with the rest of the cluster. - -Fence Agent - Fence agents were developed as device "drivers" which are able to prevent computers from destroying data on shared storage. Their aim is to isolate a corrupted computer, using one of three methods: -* Power - A computer that is switched off cannot corrupt data, but it is important to not do a "soft-reboot" as we won't know if this is possible. This also works for virtual machines when the fence device is a hypervisor. -* Network - Switches can prevent routing to a given computer, so even if a computer is powered on it won't be able to harm the data. -* Configuration - Fibre-channel switches or SCSI devices allow us to limit who can write to managed disks. -
[more context here](https://github.com/ClusterLabs/fence-agents/) +Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. * C-quorum: quorum as determined by Corosync members and algorithms * E-quorum: quorum as determined by etcd members and algorithms -Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption unless state is stored outside of the cluster. +Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption. -MCO - Machine Config Operator. This operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new `MachineConfig` capability that can write configuration files onto the host. +MCO - Machine Config Operator. This operator manages updates to node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. ABI - Agent-Based Installer. @@ -81,7 +75,7 @@ Customers with hundreds, or even tens-of-thousands, of geographically dispersed The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. -This requires our solution provide a management experience consistent with "normal" OpenShift deployments, and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. +This requires our solution to provide a management experience consistent with "normal" OpenShift deployments, and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. ### User Stories @@ -116,13 +110,13 @@ We will take advantage of RHEL-HA's native support for systemd and re-use the st The existing startup order of cri-o, then kubelet, then etcd will be preserved. The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod. -Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) unreachable peers and ensure that they can do no harm while the remaining node continues. +Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) an unreachable peer and ensure that it can do no harm while the remaining node continues. Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster-of-one. -In both cases, the control-plane will be unresponsive until etcd has been restarted. +In both cases, the control-plane's dependance on etcd will cause it to respond with errors until etcd has been restarted. Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. @@ -160,7 +154,7 @@ Confirmation can be given at any point and optionally make use of SSH to facilit Initially the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. -This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field on the Infrastructure CR - which can only succeed if the control-plane is healthy. +This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the `BareMetalPlatformSpec` part of the Infrastructure CR - which can only succeed if the control-plane is healthy. To enable this flow, we propose the addition of a `externallyManagedEtcd` field to the `BareMetalPlatformSpec` which defaults to False. This will limit the scope of CEO changes to that specific platform, and well as allow the use of a tightly scoped credential to make the change. From 6367298c75db415065bb01203d696a85cb7778d3 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Fri, 11 Oct 2024 09:37:23 +1100 Subject: [PATCH 14/49] Move discussion of resource agents to the implementation section --- enhancements/two-nodes-openshift/2no.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index b4914d83c4..9aacb5b242 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -49,9 +49,6 @@ Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosy Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general purpose fault tolerance and automatic failover for critical services and applications. -Resource Agent - A resource agent is an executable that manages a cluster resource. No formal definition of a cluster resource exists, other than "anything a cluster manages is a resource." Cluster resources can be as diverse as IP addresses, file systems, database services, and entire virtual machines - to name just a few examples. -
[more context here](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc) - Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. @@ -190,6 +187,14 @@ The entity (likely a one-shot systemd job as part of a `MachineConfig`) that con This is usually done based on the sort-order a piece of shared info (such as IP or node name). The priority takes the form of a delay, usually in the order of 10s of seconds, and is used to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. +RHEL-HA has no real understanding of the resources (IP addresses, file systems, databases, even virtual machines) it manages. +It relies on resource agents to understand how to check the state of a resource, as well as start and stop them to achieve the desired target state. +How a given agent uses these actions, and associated states, to model the resource is opaque to the cluster and depends on the needs of the underlying resource. + +Agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. +The latter being the most powerful, adding the concept of promotion, and demotion. +More information on creating OCF agents can be found in the upstream [developer guide](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc). + Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. #### Failure Scenario Timelines: From 1799260d6401d0a8e88eb1d44b5d86497a3903ab Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Sat, 12 Oct 2024 12:47:57 +1100 Subject: [PATCH 15/49] Incorporate review feedback around CRD changes --- enhancements/two-nodes-openshift/2no.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 9aacb5b242..a1cf1d5a1d 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -148,13 +148,26 @@ Confirmation can be given at any point and optionally make use of SSH to facilit ### API Extensions +There are two related but ultimately orthogonal capabilities that may require API extensions. + +1. Identify the cluster as having a unique topology +2. Tell CEO when it is safe for it to disable certain membership related functionalities + +#### Unique Topology + +A mechanism is needed for the installer and other components to understand that this is a 2 node control-plane topology which may require different handling. + +TODO: pros and cons of creating a new PlatformType, vs. feature gate, vs adding a new field to `PlatformSpec` or `BareMetalPlatformSpec` + +#### CEO Trigger + Initially the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. -This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the `BareMetalPlatformSpec` part of the Infrastructure CR - which can only succeed if the control-plane is healthy. +This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the `BareMetalPlatformSpec` portion of the `Infrastructure` CR - which can only succeed if the control-plane is healthy. To enable this flow, we propose the addition of a `externallyManagedEtcd` field to the `BareMetalPlatformSpec` which defaults to False. -This will limit the scope of CEO changes to that specific platform, and well as allow the use of a tightly scoped credential to make the change. +This will limit the scope of CEO behavioural changes to that specific platform, and well as allow the use of a tightly scoped credential to make the change. An alternative being to grant write access to all `ConfigMaps` in the `openshift-config` namespace. ### Topology Considerations @@ -170,9 +183,7 @@ TODO: Exactly what is the definition of a standalone cluster? Disconnected? Ph While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow, and then using Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. -A mechanism is needed for other components to understand that this is a 2no architecture in which etcd is externally managed and RHEL-HA is in use. -The proposed `externallyManagedEtcd` field for `BareMetalPlatformSpec` in combination with the node count may be sufficient. -Alternatively we may wish to make this explicit by creating a new feature gate. +TODO: Finalize component delivery based on MCO team guidance. The delivery of RHEL-HA components will be opaque to the user and either come: @@ -311,6 +322,8 @@ Satisfying this demand would come with significant technical and support overhea The answer may change as in-progress MCO features mature. 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? +1. How to best indicate that this is a unique topology. +1. The relevance of disconnected installation/functions to the proposal. ## Test Plan From b8152a3c0ca9f41c4eedd04e363fcbd9bed05c52 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Mon, 14 Oct 2024 11:20:44 +1100 Subject: [PATCH 16/49] Incorporate guidance from MCO team around package delivery --- enhancements/two-nodes-openshift/2no.md | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index a1cf1d5a1d..3f7d0bb01c 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -183,13 +183,8 @@ TODO: Exactly what is the definition of a standalone cluster? Disconnected? Ph While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow, and then using Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. -TODO: Finalize component delivery based on MCO team guidance. - -The delivery of RHEL-HA components will be opaque to the user and either come: - -* as an MCO Layer (this feature is targeting GA in 4.19), -* as an extension (supported today), or -* included, but inactive, in the base image +The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. +A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. Configuration of the RHEL-HA components will be via one or more `MachineConfig`s, and will require RedFish details to have been collected by the installer. Sensible defaults will be chosen where possible, and user customization only where absolutely necessary. @@ -318,8 +313,6 @@ Satisfying this demand would come with significant technical and support overhea ## Open Questions [optional] -1. How to best deliver RHEL-HA components to the nodes is currently under discussion with the MCO team. - The answer may change as in-progress MCO features mature. 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? 1. How to best indicate that this is a unique topology. From 7b24ee5496231fecc103c38526e62e0fd4ef0d91 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Mon, 14 Oct 2024 21:27:36 +1100 Subject: [PATCH 17/49] Remove boilerplate text --- enhancements/two-nodes-openshift/2no.md | 122 ++---------------------- 1 file changed, 8 insertions(+), 114 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 3f7d0bb01c..e809d1db24 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -323,76 +323,13 @@ Satisfying this demand would come with significant technical and support overhea **Note:** *Section not required until targeted at a release.* -Consider the following in developing a test plan for this enhancement: -- Will there be e2e and integration tests, in addition to unit tests? -- How will it be tested in isolation vs with other components? -- What additional testing is necessary to support managed OpenShift service-based offerings? - -No need to outline all of the test cases, just the general strategy. Anything -that would count as tricky in the implementation and anything particularly -challenging to test should be called out. - -All code is expected to have adequate tests (eventually with coverage -expectations). +See template for guidelines/instructions. ## Graduation Criteria **Note:** *Section not required until targeted at a release.* -Define graduation milestones. - -These may be defined in terms of API maturity, or as something else. Initial proposal -should keep this high-level with a focus on what signals will be looked at to -determine graduation. - -Consider the following in developing the graduation criteria for this -enhancement: - -- Maturity levels - - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] - - `Dev Preview`, `Tech Preview`, `GA` in OpenShift -- [Deprecation policy][deprecation-policy] - -Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), -or by redefining what graduation means. - -In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. - -[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions -[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ - -**If this is a user facing change requiring new or updated documentation in [openshift-docs](https://github.com/openshift/openshift-docs/), -please be sure to include in the graduation criteria.** - -**Examples**: These are generalized examples to consider, in addition -to the aforementioned [maturity levels][maturity-levels]. - -### Dev Preview -> Tech Preview - -- Ability to utilize the enhancement end to end -- End user documentation, relative API stability -- Sufficient test coverage -- Gather feedback from users rather than just developers -- Enumerate service level indicators (SLIs), expose SLIs as metrics -- Write symptoms-based alerts for the component(s) - -### Tech Preview -> GA - -- More testing (upgrade, downgrade, scale) -- Sufficient time for feedback -- Available by default -- Backhaul SLI telemetry -- Document SLOs for the component -- Conduct load testing -- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) - -**For non-optional features moving to GA, the graduation criteria must include -end to end tests.** - -### Removing a deprecated feature - -- Announce deprecation and support policy of the existing feature -- Deprecate the feature +See template for guidelines/instructions. ## Upgrade / Downgrade Strategy @@ -414,34 +351,24 @@ enhancement: ## Operational Aspects of API Extensions -Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs, -admission and conversion webhooks, aggregated API servers, finalizers) here in detail, -especially how they impact the OCP system architecture and operational aspects. +See template for guidelines/instructions. - For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level Indicators) an administrator or support can use to determine the health of the API extensions - Examples (metrics, alerts, operator conditions) - - authentication-operator condition `APIServerDegraded=False` - - authentication-operator condition `APIServerAvailable=True` - - openshift-authentication/oauth-apiserver deployment and pods health + N/A - What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, API availability) - Examples: - - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average. - - Fails creation of ConfigMap in the system when the webhook is not available. - - Adds a dependency on the SDN service network for all resources, risking API availability in case - of SDN issues. - - Expected use-cases require less than 1000 instances of the CRD, not impacting - general API throughput. + [TODO: Expand] Toggling CEO control values with result in etcd being briefly offline. - How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review this enhancement) - Describe the possible failure modes of the API extensions. + - Describe how a failure or behaviour of the extension will impact the overall cluster health (e.g. which kube-controller-manager functionality will stop working), especially regarding stability, availability, performance and security. @@ -450,51 +377,18 @@ especially how they impact the OCP system architecture and operational aspects. ## Support Procedures +See template for guidelines/instructions. + Describe how to - detect the failure modes in a support situation, describe possible symptoms (events, metrics, alerts, which log output in which component) - - Examples: - - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz". - - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed". - - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")` - will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire. - - disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) - - What consequences does it have on the cluster health? - - Examples: - - Garbage collection in kube-controller-manager will stop working. - - Quota will be wrongly computed. - - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data. - Disabling the conversion webhook will break garbage collection. - - What consequences does it have on existing, running workloads? - - Examples: - - New namespaces won't get the finalizer "xyz" and hence might leak resource X - when deleted. - - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod - communication after some minutes. - - What consequences does it have for newly created workloads? - - Examples: - - New pods in namespace with Istio support will not get sidecars injected, breaking - their networking. - - Does functionality fail gracefully and will work resume when re-enabled without risking consistency? - Examples: - - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence - will not block the creation or updates on objects when it fails. When the - webhook comes back online, there is a controller reconciling all objects, applying - labels that were not applied during admission webhook downtime. - - Namespaces deletion will not delete all objects in etcd, leading to zombie - objects when another namespace with the same name is created. - ## Alternatives * MicroShift was considered as an alternative but it was ruled out because it does not support multi node has a very different experience then OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes From 4f4517e2dc548a93daf4ab91fbe19bdeff149b39 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 15 Oct 2024 12:56:44 +1100 Subject: [PATCH 18/49] Update based on team discussion for distinguishing this topology --- enhancements/two-nodes-openshift/2no.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index e809d1db24..52992d98e9 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -155,20 +155,21 @@ There are two related but ultimately orthogonal capabilities that may require AP #### Unique Topology -A mechanism is needed for the installer and other components to understand that this is a 2 node control-plane topology which may require different handling. +A mechanism is needed for components of the cluster to understand that this is a 2 node control-plane topology which may require different handling. +We will define a new value for the `TopologyMode` enum: `DualReplicaTopologyMode`. -TODO: pros and cons of creating a new PlatformType, vs. feature gate, vs adding a new field to `PlatformSpec` or `BareMetalPlatformSpec` +However `TopologyMode` is not available at the point the Agent Based Installer (ABI) performs validation. +We will therefore additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml`, and which ABI can use to validate the proposed cluster - such as the proposed node count. #### CEO Trigger Initially the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. -This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the `BareMetalPlatformSpec` portion of the `Infrastructure` CR - which can only succeed if the control-plane is healthy. +This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. -To enable this flow, we propose the addition of a `externallyManagedEtcd` field to the `BareMetalPlatformSpec` which defaults to False. -This will limit the scope of CEO behavioural changes to that specific platform, and well as allow the use of a tightly scoped credential to make the change. -An alternative being to grant write access to all `ConfigMaps` in the `openshift-config` namespace. +To enable this flow, we propose the addition of a `externallyManagedEtcd` field which defaults to `False`, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. +This will allow the use of a credential scoped to `ConfigMap`s in the `openshift-etcd-operator` namespace, to make the change. ### Topology Considerations @@ -315,7 +316,6 @@ Satisfying this demand would come with significant technical and support overhea 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? -1. How to best indicate that this is a unique topology. 1. The relevance of disconnected installation/functions to the proposal. From 481fee2be943b7a21b15216f496c3fbcbe36124e Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 15 Oct 2024 13:30:21 +1100 Subject: [PATCH 19/49] Update enhancements/two-nodes-openshift/2no.md Co-authored-by: Zane Bitter --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 52992d98e9..d8c4a160f5 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -198,7 +198,7 @@ RHEL-HA has no real understanding of the resources (IP addresses, file systems, It relies on resource agents to understand how to check the state of a resource, as well as start and stop them to achieve the desired target state. How a given agent uses these actions, and associated states, to model the resource is opaque to the cluster and depends on the needs of the underlying resource. -Agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. +Resource agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. The latter being the most powerful, adding the concept of promotion, and demotion. More information on creating OCF agents can be found in the upstream [developer guide](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc). From bc0caca89a0bdebb3be252e6295d6e65f00d955e Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 15 Oct 2024 13:49:16 +1100 Subject: [PATCH 20/49] Highlight the need to collect passwords and fix the approvers list --- enhancements/two-nodes-openshift/2no.md | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index d8c4a160f5..5d85f4f14a 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -15,24 +15,12 @@ reviewers: - "@eranco74" - "@yuqi-zhang" - "@gamado" - - "@razo7" - "@frajamomo" - "@clobrano" - approvers: - - "@rwsu" - - "@fabbione" - - "@carbonin" - "@thomasjungblut" - - "@brandisher" - - "@DanielFroehlich" - - "@jerpeter1" - - "@slintes" - - "@beekhof" - - "@eranco74" - - "@yuqi-zhang" api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" - - "@jerpeter1" + - "@deads2k" creation-date: 2024-09-05 last-updated: 2024-09-22 tracking-link: @@ -129,7 +117,7 @@ Creation of a two node control-plane will be possible via the core installer (wi In the case of the core OpenShift installer, the user-facing proceedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. -In the case of the Assisted Installer, the user-facing proceedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details for each node which are needed for the RHEL-HA configuration. +In the case of the Assisted Installer, the user-facing proceedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. Everything else about cluster creation will be an opaque implementation detail not exposed to the user. From 75f8dc784fcf1d366c8076bc5dac74089cda3dc2 Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Tue, 15 Oct 2024 13:54:07 +1100 Subject: [PATCH 21/49] The use of booleans is frowned upon --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 5d85f4f14a..47e122091f 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -156,7 +156,7 @@ Once the cluster has two members, the etcd daemon will be removed from the stati At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. -To enable this flow, we propose the addition of a `externallyManagedEtcd` field which defaults to `False`, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. +To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. This will allow the use of a credential scoped to `ConfigMap`s in the `openshift-etcd-operator` namespace, to make the change. ### Topology Considerations From 744dc8b80aef04f9b978d4b82097e4d65888620e Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Wed, 16 Oct 2024 10:26:46 +1100 Subject: [PATCH 22/49] Update enhancements/two-nodes-openshift/2no.md Co-authored-by: Joel Speed --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 47e122091f..a8452f0c7d 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -114,7 +114,7 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte Creation of a two node control-plane will be possible via the core installer (with an additional bootstrap node), and via the Assisted Installer (without an additional bootstrap node). -In the case of the core OpenShift installer, the user-facing proceedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. +In the case of the core OpenShift installer, the user-facing procedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. In the case of the Assisted Installer, the user-facing proceedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. From ab0aef038d4fb3b64e5a0ca21dbeb35c0ecd12cc Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Wed, 16 Oct 2024 10:52:37 +1100 Subject: [PATCH 23/49] Update enhancements/two-nodes-openshift/2no.md Co-authored-by: Joel Speed --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index a8452f0c7d..15b3c6cd25 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -117,7 +117,7 @@ Creation of a two node control-plane will be possible via the core installer (wi In the case of the core OpenShift installer, the user-facing procedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. -In the case of the Assisted Installer, the user-facing proceedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. +In the case of the Assisted Installer, the user-facing procedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. Everything else about cluster creation will be an opaque implementation detail not exposed to the user. From cf8f12e6e91a25f85b6073bd2a461ec556f31d9c Mon Sep 17 00:00:00 2001 From: Andrew Beekhof Date: Wed, 16 Oct 2024 11:41:31 +1100 Subject: [PATCH 24/49] Small points of clarification based on review questions --- enhancements/two-nodes-openshift/2no.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 15b3c6cd25..4c61d716d6 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -93,7 +93,7 @@ Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivere etcd will run as as a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. The existing startup order of cri-o, then kubelet, then etcd will be preserved. -The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod. +The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod, the contents of which remains under the exclusive control of the Cluster Etcd Operator (CEO). Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) an unreachable peer and ensure that it can do no harm while the remaining node continues. @@ -104,7 +104,8 @@ Upon a network failure, the RHEL-HA components ensure that exactly one node will In both cases, the control-plane's dependance on etcd will cause it to respond with errors until etcd has been restarted. Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. -If the peer is likely to remain offline for an extended period of time, admin confirmation is required to allow the node to start OpenShift. +If the failed peer is likely to remain offline for an extended period of time, admin confirmation is required on the remaining node to allow it to start OpenShift. +The functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. @@ -112,12 +113,10 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte #### Cluster Creation -Creation of a two node control-plane will be possible via the core installer (with an additional bootstrap node), and via the Assisted Installer (without an additional bootstrap node). +User creation of a two node control-plane will be possible via the Assisted Installer. -In the case of the core OpenShift installer, the user-facing procedure is unchanged from a standard "IPI" installation, other than the configuration of 2 nodes instead of 3. -Internally, the RedFish details for each node will need to make their way into the RHEL-HA configuration, but this is information already required for bare-metal hosts. - -In the case of the Assisted Installer, the user-facing procedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. +The procedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. +If available via the SaaS offering (not confirmed, ZTP may be the target), the offering will need to ensure passwords are appropriately handled. Everything else about cluster creation will be an opaque implementation detail not exposed to the user. @@ -144,7 +143,8 @@ There are two related but ultimately orthogonal capabilities that may require AP #### Unique Topology A mechanism is needed for components of the cluster to understand that this is a 2 node control-plane topology which may require different handling. -We will define a new value for the `TopologyMode` enum: `DualReplicaTopologyMode`. +We will define a new value for the `TopologyMode` enum: `DualReplica`. +The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica` and `External`. However `TopologyMode` is not available at the point the Agent Based Installer (ABI) performs validation. We will therefore additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml`, and which ABI can use to validate the proposed cluster - such as the proposed node count. From f9683a7d8f89aa1aa321193ebe1d337f4c3f3edd Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Fri, 18 Oct 2024 13:45:00 -0400 Subject: [PATCH 25/49] Added jaypoulz to authors and minor updates to wording and grammar. In this commit, I ran the proposal through grammarly and fixed the more glaring grammatical issues a few of the stylistic ones. I also softened the language for a few statements. --- enhancements/two-nodes-openshift/2no.md | 110 ++++++++++++------------ 1 file changed, 57 insertions(+), 53 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 4c61d716d6..3b4d4d4964 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -2,6 +2,7 @@ title: 2no authors: - "@mshitrit" + - "@jaypoulz" reviewers: - "@rwsu" - "@fabbione" @@ -19,6 +20,7 @@ reviewers: - "@clobrano" approvers: - "@thomasjungblut" + - "@jerpeter1" api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" - "@deads2k" creation-date: 2024-09-05 @@ -31,11 +33,11 @@ tracking-link: ## Terms -RHEL-HA - a general purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. +RHEL-HA - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. -Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general purpose fault tolerance and automatic failover for critical services and applications. +Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover for critical services and applications. Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. @@ -43,68 +45,70 @@ Quorum - having the minimum number of members required for decision-making. The * C-quorum: quorum as determined by Corosync members and algorithms * E-quorum: quorum as determined by etcd members and algorithms -Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running by other groups. Typically results in data loss or corruption. +Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically results in data loss or corruption. -MCO - Machine Config Operator. This operator manages updates to node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. +MCO - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. ABI - Agent-Based Installer. ## Summary -Leverage traditional high-availability concepts and technologies to provide a container management solution suitable for customers with numerous geographically dispersed locations that has a minimal footprint but remains resilient to single node-level failures. +Leverage traditional high-availability concepts and technologies to provide a container management solution suitable that has a minimal footprint but remains resilient to single node-level failures suitable for customers with numerous geographically dispersed locations. ## Motivation -Customers with hundreds, or even tens-of-thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience to node level failures, but does not come with a traditional three-node footprint and/or price tag. +Customers with hundreds, or even tens of thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience to node-level failures but does not come with a traditional three-node footprint and/or price tag. The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. -This requires our solution to provide a management experience consistent with "normal" OpenShift deployments, and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. +This requires our solution to provide a management experience consistent with "normal" OpenShift deployments and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. ### User Stories * As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. * As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized. +* As an enterprise running workloads on a minimal OpenShift footprint, I want to minimize time-to-recovery and data loss for my workloads when a node fails. ### Goals * Provide a two-node control-plane for physical hardware that is resilient to a node-level failure for either node -* Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two node cluster +* Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two-node cluster * Prevent both data corruption and divergent datasets in etcd -* Maintain the existing level of availability. Eg. by avoiding fencing loops, wherein each node powers cycles it's peer after booting, reducing the cluster's availability. -* Recover the API server in less than 120s, as measured from the surviving node's detection of a failure -* Minimize any differences to the primary OpenShift platforms +* Minimize recovery-caused unavailability. Eg. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability. +* Recover the API server in less than 120s, as measured by the surviving node's detection of a failure +* Minimize any differences to existing OpenShift topologies * Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures -* Provide an OpenShift cluster experience that is identical to that of a 3-node hyperconverged cluster, but with 2 nodes +* Provide an OpenShift cluster experience that is similar to that of a 3-node hyperconverged cluster but with 2 nodes ### Non-Goals * Workload resilience - see related [Pre-DRAFT enhancement](https://docs.google.com/document/d/1TDU_4I4LP6Z9_HugeC-kaQ297YvqVJQhBs06lRIC9m8/edit) * Resilient storage - see future enhancement -* Support for platforms other than bare metal including automated ci testing +* Support for platforms other than bare metal including automated CI testing * Support for other topologies (eg. hypershift) * Adding worker nodes -* Creation RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) +* Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) +* Supporting upgrade/downgrade paths between two-node and other architectures (for initial release) ## Proposal -Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to delivered supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. +Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. etcd will run as as a voting member on both nodes. -We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kublet units, as well as create a new Open Cluster Framework (OCF) script for etcd. +We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kubelet units, as well as create a new Open Cluster Framework (OCF) script for etcd. The existing startup order of cri-o, then kubelet, then etcd will be preserved. -The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod, the contents of which remains under the exclusive control of the Cluster Etcd Operator (CEO). +The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod, the contents of which remain under the exclusive control of the Cluster Etcd Operator (CEO). Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) an unreachable peer and ensure that it can do no harm while the remaining node continues. -Upon a peer failure, the RHEL-HA components on the surivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. +Upon a peer failure, the RHEL-HA components on the survivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. -Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence it's peer, and use the OCF script to restart etcd as a new cluster-of-one. +Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence its peer, and use the OCF script to restart etcd as a new cluster-of-one. -In both cases, the control-plane's dependance on etcd will cause it to respond with errors until etcd has been restarted. +In both cases, the control-plane's dependence on etcd will cause it to respond with errors until etcd has been restarted. -Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees it's peer. -If the failed peer is likely to remain offline for an extended period of time, admin confirmation is required on the remaining node to allow it to start OpenShift. +Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees its peer. +If the failed peer is likely to remain offline for an extended period, admin confirmation is required on the remaining node to allow it to start OpenShift. The functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. @@ -113,24 +117,24 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte #### Cluster Creation -User creation of a two node control-plane will be possible via the Assisted Installer. +User creation of a two-node control-plane will be possible via the Assisted Installer. The procedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. If available via the SaaS offering (not confirmed, ZTP may be the target), the offering will need to ensure passwords are appropriately handled. Everything else about cluster creation will be an opaque implementation detail not exposed to the user. -#### Day 2 Proceedures +#### Day 2 Procedures As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. Such operations will only proceed when both peers are online and healthy. The experience of managing a 2-node control-plane should be largely indistinguishable from that of a 3-node one. -The primary exception is (re)booting one of the peers while the other is offline, and expected to remain so. +The primary exception is (re)booting one of the peers while the other is offline and expected to remain so. As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. Should the admin wish for the control-plane to start, the admin will need to execute a supplied confirmation command on the active cluster node. -This command will grant quorum to the RHEL-HA components, authorizing it to fence it's peer and start etcd in as a cluster-of-one read/write mode. +This command will grant quorum to the RHEL-HA components, authorizing it to fence its peer and start etcd as a cluster-of-one in read/write mode. Confirmation can be given at any point and optionally make use of SSH to facilitate initiation by an external script. ### API Extensions @@ -138,20 +142,20 @@ Confirmation can be given at any point and optionally make use of SSH to facilit There are two related but ultimately orthogonal capabilities that may require API extensions. 1. Identify the cluster as having a unique topology -2. Tell CEO when it is safe for it to disable certain membership related functionalities +2. Tell CEO when it is safe for it to disable certain membership-related functionalities #### Unique Topology -A mechanism is needed for components of the cluster to understand that this is a 2 node control-plane topology which may require different handling. +A mechanism is needed for components of the cluster to understand that this is a 2-node control-plane topology which may require different handling. We will define a new value for the `TopologyMode` enum: `DualReplica`. -The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica` and `External`. +The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`. However `TopologyMode` is not available at the point the Agent Based Installer (ABI) performs validation. We will therefore additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml`, and which ABI can use to validate the proposed cluster - such as the proposed node count. #### CEO Trigger -Initially the creation of an etcd cluster will be driven in the same way as other platforms. +Initially, the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. @@ -161,7 +165,7 @@ This will allow the use of a credential scoped to `ConfigMap`s in the `openshift ### Topology Considerations -2NO represents a new topology, and is not appropriate for use with HyperShift, SNO, or MicroShift +2NO represents a new topology and is not appropriate for use with HyperShift, SNO, or MicroShift #### Standalone Clusters @@ -170,16 +174,16 @@ TODO: Exactly what is the definition of a standalone cluster? Disconnected? Ph ### Implementation Details/Notes/Constraints -While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow, and then using Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. +While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow and then using the Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Configuration of the RHEL-HA components will be via one or more `MachineConfig`s, and will require RedFish details to have been collected by the installer. -Sensible defaults will be chosen where possible, and user customization only where absolutely necessary. +Configuration of the RHEL-HA components will be via one or more `MachineConfig`s and will require RedFish details to have been collected by the installer. +Sensible defaults will be chosen where possible, and user customization only where necessary. The entity (likely a one-shot systemd job as part of a `MachineConfig`) that configures RHEL-HA will also configure a fencing priority. -This is usually done based on the sort-order a piece of shared info (such as IP or node name). +This is usually done based on the sort order of a piece of shared info (such as IP or node name). The priority takes the form of a delay, usually in the order of 10s of seconds, and is used to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. RHEL-HA has no real understanding of the resources (IP addresses, file systems, databases, even virtual machines) it manages. @@ -187,7 +191,7 @@ It relies on resource agents to understand how to check the state of a resource, How a given agent uses these actions, and associated states, to model the resource is opaque to the cluster and depends on the needs of the underlying resource. Resource agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. -The latter being the most powerful, adding the concept of promotion, and demotion. +The latter is the most powerful, adding the concept of promotion, and demotion. More information on creating OCF agents can be found in the upstream [developer guide](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc). Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. @@ -196,7 +200,7 @@ Tools for extracting support information (must-gather tarballs) will be updated 1. Cold Boot 1. One node (Node1) boots - 2. Node1 does have “corosync quorum” (C-quorum) (requires forming a membership with it’s peer) + 2. Node1 does have “corosync quorum” (C-quorum) (requires forming a membership with its peer) 3. Node1 does not start etcd or kubelet, remains inert waiting for Node2 4. Peer (Node2) boots 5. Corosync membership containing both nodes forms @@ -216,12 +220,12 @@ Tools for extracting support information (must-gather tarballs) will be updated 6. Cluster continues with no redundancy 7. … time passes … 8. Node2 boots - persistent network failure - * Node2 does not have C-quorum (requires forming a membership with it’s peer) + * Node2 does not have C-quorum (requires forming a membership with its peer) * Node2 does not start etcd or kubelet, remains inert waiting for Node1 9. Network is repaired 10. Corosync membership containing both nodes forms 11. Pacemaker “starts” etcd on Node2 as a follower of Node1 - 12. Pacemaker “promotes” etcd on Node2 as full replica of Node1 + 12. Pacemaker “promotes” etcd on Node2 as a full replica of Node1 13. Pacemaker starts kubelet 14. Cluster continues with 1+1 redundancy 3. Node Failure @@ -232,12 +236,12 @@ Tools for extracting support information (must-gather tarballs) will be updated 5. Cluster continues with no redundancy 6. … time passes … 7. Node2 has a persistent failure that prevents communication with Node1 - * Node2 does not have C-quorum (requires forming a membership with it’s peer) + * Node2 does not have C-quorum (requires forming a membership with its peer) * Node2 does not start etcd or kubelet, remains inert waiting for Node1 8. Persistent failure on Node2 is repaired 9. Corosync membership containing both nodes forms 10. Pacemaker “starts” etcd on Node2 as a follower of Node1 - 11. Pacemaker “promotes” etcd on Node2 as full replica of Node1 + 11. Pacemaker “promotes” etcd on Node2 as a full replica of Node1 12. Pacemaker starts kubelet 13. Cluster continues with 1+1 redundancy 4. Two Failures @@ -251,8 +255,8 @@ Tools for extracting support information (must-gather tarballs) will be updated 8. … time passes … 9. Node1 Power restored 10. Node1 boots but can not gain quorum before Node2 joins the cluster due to a risk of fencing loop - * Mitigation (Phase 1): manual intervention (possibly a script) in case admin can guarantee Node2 is down, which will grant Node1 quorum and restore cluster limited (none HA) functionality. - * Mitigation (Phase 2): limited automatic intervention for some use cases: for example Node1 will gain quorum only if Node2 can be verified to be down by successfully querying its BMC status. + * Mitigation (Phase 1): manual intervention (possibly a script) in case the admin can guarantee Node2 is down, which will grant Node1 quorum and restore cluster limited (none HA) functionality. + * Mitigation (Phase 2): limited automatic intervention for some use cases: for example, Node1 will gain quorum only if Node2 can be verified to be down by successfully querying its BMC status. 5. Kubelet Failure 1. Pacemaker’s monitoring detects the failure 2. Pacemaker restarts kubelet @@ -276,11 +280,11 @@ Tools for extracting support information (must-gather tarballs) will be updated 1. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites 1. Mitigation: Lifecycle events, such as upgrades and applying new `MachineConfig`s, are not permitted in a single-node degraded state - 1. Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will futher reduce the need for reboots. + 1. Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will further reduce the need for reboots. 1. Mitigation: The node will be reachable via SSH and the confirmation can be scripted 1. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. -1. Risk: “Something changed, lets reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. +1. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. This may require an additional enhancement. @@ -321,7 +325,7 @@ See template for guidelines/instructions. ## Upgrade / Downgrade Strategy -In-place upgrades and downgrades will not be supported for this first iteration, and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. +In-place upgrades and downgrades will not be supported for this first iteration and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. ## Version Skew Strategy @@ -332,16 +336,16 @@ Consider the following in developing a version skew strategy for this enhancement: - During an upgrade, we will always have skew among components, how will this impact your work? - Does this enhancement involve coordinating behavior in the control-plane and - in the kubelet? How does an n-2 kubelet without this feature available behave + the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used? -- Will any other components on the node change? For example, changes to CSI, CRI +- Will any other components on the node change? For example, changes to CSI, CRI, or CNI may require updating that component before the kubelet. ## Operational Aspects of API Extensions See template for guidelines/instructions. -- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level +- For conversion/admission webhooks and aggregated API servers: what are the SLIs (Service Level Indicators) an administrator or support can use to determine the health of the API extensions N/A @@ -359,7 +363,7 @@ See template for guidelines/instructions. - Describe how a failure or behaviour of the extension will impact the overall cluster health (e.g. which kube-controller-manager functionality will stop working), especially regarding - stability, availability, performance and security. + stability, availability, performance, and security. - Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes and add them as reviewers to this enhancement. @@ -379,7 +383,7 @@ Describe how to ## Alternatives -* MicroShift was considered as an alternative but it was ruled out because it does not support multi node has a very different experience then OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes +* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes * 2 SNO + KCP @@ -389,10 +393,10 @@ The main advantage of this approach is that it doesn’t require inventing a new Disadvantages: * Production readiness * KCP itself could become a single point of failure (need to configure pacemaker to manage KCP) -* KCP adds an additional layer of complexity to the architecture +* KCP adds additional complexity to the architecture ## Infrastructure Needed [optional] Use this section if you need things from the project. Examples include a new -subproject, repos requested, github details, and/or testing infrastructure. +subproject, repos requested, GitHub details, and/or testing infrastructure. From 08f4c247e9cd63ca6b80c30be5ba5a84fa639116 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Fri, 18 Oct 2024 14:30:57 -0400 Subject: [PATCH 26/49] Updated 2NO enhancement with a collection of open questions. --- enhancements/two-nodes-openshift/2no.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 3b4d4d4964..aa774efb08 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -308,7 +308,22 @@ Satisfying this demand would come with significant technical and support overhea 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? -1. The relevance of disconnected installation/functions to the proposal. +2. Is there are requirement for disconnected cluster support in the initial release? +3. Are there consequences of changing the parentage of processes running cri-o, kubelet, and etcd? (E.g. user process limits) +4. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? +5. How are the BMC credentials getting from the install-config and onto the nodes? +6. How does the cluster know it has achieved a ready state? + From the cluster's perspective, we know that CEO needs to have `managedEtcd: External` set, and all of the operators need to be available. However, there is a time delay between when that configuration is set by CEO and when etcd comes back up as health after it's restarted under the RHEL-HA components. Is a fixed wait enough to determine that we have successfully transitioned to the new topology? If not, how do we detect this? +7. Are there any scenarios that would require signaling from the cluster to the RHEL-HA components to modify or change their behavior? +8. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? +9. If both nodes fail, which one should come back first? +10. Which component is responsible for stopping etcd once CEO relinquishes control of it? When is it stopped? +11. Which installers will be supported for the initial release? + The current discussions around the installer point us towards ABI for the initial release. There also seems to be interest in making this available for ZTP for managed clusters. +12. Which platform specs will be available for this topology? + As discussed, we are currently targeting the BareMetalPlatform spec, but the load-balancing component needs to be evaluated for compatibility. +13. What happens if something fails during the initial setup of the RHEL-HA stack? How will this be communicated back to the user? + For example, what happens if the setup job fails and etcd is left running? From the perspective of the user and the cluster, this would be identical to etcd being stopped and restarted under the control of the RHEL-HA components. Nothing about the cluster knows about the external entity that owns etcd. ## Test Plan From f7ffc958850110d757a79a106434ab3b24af31cb Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Fri, 18 Oct 2024 15:36:22 -0400 Subject: [PATCH 27/49] Added initial 2NO test plan details. --- enhancements/two-nodes-openshift/2no.md | 34 ++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index aa774efb08..4f339e3789 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -330,7 +330,39 @@ Satisfying this demand would come with significant technical and support overhea **Note:** *Section not required until targeted at a release.* -See template for guidelines/instructions. +### CI +The initial release of 2NO should aim to build a regression baseline. + +| Type | Name | Description | +| ----- | ----------------------------- | --------------------------------------------------------------------------- | +| Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | +| Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | +| Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | +| Suite | 2NO Recovery | This is a new suite consisting of the tests listed below. | +| Test | Node failure [^2] | A new 2NO test to detect if the cluster recovers if a node crashes. | +| Test | Network failure [^2] | A new 2NO test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | +| Test | Kubelet failure [^2] | A new 2NO test to detect if the cluster recovers if kubelet fails. | +| Test | Etcd failure [^2] | A new 2NO test to detect if the cluster recovers if etcd fails. | + +[^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the +topology. +[^2]: These tests will be designed to make a component on the *other* node fail. This should prevent the test pod from +being restarted mid-test. + +### QE +This section outlines test scenarios for 2NO. + +| Scenario | Description | +| ----------------------------- | ----------------------------------------------------------------------------------- | +| Payload install | A basic evaluation that the cluster installs on supported hardware. Should be run for each supported installation method. | +| Payload upgrade | A basic evaluation that the cluster can upgrade between releases. | +| Performance | Performance metrics are gathered and compared to SNO and Compact HA | +| Scalability | Scalability metrics are gathered and compared to SNO and Compact HA | +| Cold Boot | Verify that clusters can survive a cold boot event. | +| Both nodes crash | Verify that clusters can survive an event where both nodes become unavailable. | + +As noted above, there is an open question about how layered products should be treated in the test plan. +Additionally, it would be good to have workload-specific testing once those are defined by the workload proposal. ## Graduation Criteria From 276248db1d19e672e877c133c56744ab2f4ec46d Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Thu, 31 Oct 2024 10:06:45 +0200 Subject: [PATCH 28/49] Adding Ben Nemec as a reviewer from the networking perspective Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 1 + 1 file changed, 1 insertion(+) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 4f339e3789..6edc9beaf1 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -18,6 +18,7 @@ reviewers: - "@gamado" - "@frajamomo" - "@clobrano" + - "@cybertron" approvers: - "@thomasjungblut" - "@jerpeter1" From ddbcc6577a48214f6b86eedbe963693ae6522f02 Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Thu, 31 Oct 2024 10:41:04 +0200 Subject: [PATCH 29/49] Adding mandatory sections for markdownlint job Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 44 ++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 6edc9beaf1..86a2bb71fa 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -269,6 +269,21 @@ Tools for extracting support information (must-gather tarballs) will be updated 3. Stop failure is optionally escalated to a node failure (fencing) 4. Start failure defaults to leaving the service offline +#### Hypershift / Hosted Control Planes + +Are there any unique considerations for making this change work with +Hypershift? + +See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 + +How does it affect any of the components running in the +management cluster? How does it affect any components running split +between the management cluster and guest cluster? + +#### Single-node Deployments or MicroShift + +How does this proposal affect the resource consumption of a +single-node OpenShift deployment (SNO), CPU and memory? ### Risks and Mitigations @@ -371,9 +386,36 @@ Additionally, it would be good to have workload-specific testing once those are See template for guidelines/instructions. +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + ## Upgrade / Downgrade Strategy -In-place upgrades and downgrades will not be supported for this first iteration and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. +In-place upgrades and downgrades will not be supported for this first iteration, and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. ## Version Skew Strategy From 636f3e3512fffe7daa793a70cd48be457917019a Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Sun, 3 Nov 2024 09:17:05 +0200 Subject: [PATCH 30/49] Small point of clarification based on review Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 86a2bb71fa..c3ff0aae3a 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -205,7 +205,7 @@ Tools for extracting support information (must-gather tarballs) will be updated 3. Node1 does not start etcd or kubelet, remains inert waiting for Node2 4. Peer (Node2) boots 5. Corosync membership containing both nodes forms - 6. Pacemaker “starts” etcd on both nodes + 6. Pacemaker starts etcd on both nodes * Detail, this could be a “soft”-start which allows us to determine which node has the most recent dataset. 7. Pacemaker “promotes” etcd on whichever node has the most recent dataset 8. Pacemaker “promotes” etcd on the peer once it has caught up From d800dd5fcc8e395418b7a4e805dc5918f14ba70b Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Tue, 5 Nov 2024 16:50:34 +0200 Subject: [PATCH 31/49] Removing some open questions based on PR feedback Signed-off-by: Michael Shitrit Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index c3ff0aae3a..bdf443d663 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -324,21 +324,18 @@ Satisfying this demand would come with significant technical and support overhea 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? -2. Is there are requirement for disconnected cluster support in the initial release? -3. Are there consequences of changing the parentage of processes running cri-o, kubelet, and etcd? (E.g. user process limits) -4. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? -5. How are the BMC credentials getting from the install-config and onto the nodes? -6. How does the cluster know it has achieved a ready state? +2. Are there consequences of changing the parentage of processes running cri-o, kubelet, and etcd? (E.g. user process limits) +3. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? +4. How are the BMC credentials getting from the install-config and onto the nodes? +5. How does the cluster know it has achieved a ready state? From the cluster's perspective, we know that CEO needs to have `managedEtcd: External` set, and all of the operators need to be available. However, there is a time delay between when that configuration is set by CEO and when etcd comes back up as health after it's restarted under the RHEL-HA components. Is a fixed wait enough to determine that we have successfully transitioned to the new topology? If not, how do we detect this? -7. Are there any scenarios that would require signaling from the cluster to the RHEL-HA components to modify or change their behavior? -8. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? -9. If both nodes fail, which one should come back first? -10. Which component is responsible for stopping etcd once CEO relinquishes control of it? When is it stopped? -11. Which installers will be supported for the initial release? +6. Are there any scenarios that would require signaling from the cluster to the RHEL-HA components to modify or change their behavior? +7. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? +8. Which installers will be supported for the initial release? The current discussions around the installer point us towards ABI for the initial release. There also seems to be interest in making this available for ZTP for managed clusters. -12. Which platform specs will be available for this topology? +9. Which platform specs will be available for this topology? As discussed, we are currently targeting the BareMetalPlatform spec, but the load-balancing component needs to be evaluated for compatibility. -13. What happens if something fails during the initial setup of the RHEL-HA stack? How will this be communicated back to the user? +10. What happens if something fails during the initial setup of the RHEL-HA stack? How will this be communicated back to the user? For example, what happens if the setup job fails and etcd is left running? From the perspective of the user and the cluster, this would be identical to etcd being stopped and restarted under the control of the RHEL-HA components. Nothing about the cluster knows about the external entity that owns etcd. From 1ba70e7463ed4cb59c0b5ed35182ff15ffc75a0a Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Tue, 5 Nov 2024 16:51:38 +0200 Subject: [PATCH 32/49] Small point of clarification based on review Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index bdf443d663..86ad8f9d91 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -318,7 +318,7 @@ single-node OpenShift deployment (SNO), CPU and memory? The two-node architecture represents yet another distinct install type for users to choose from. The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. -Satisfying this demand would come with significant technical and support overhead. +Satisfying this demand would come with significant technical and support overhead which is out of scope for this enhancement. ## Open Questions [optional] From cd46f94e626b382bb71615dea804d086eb5c82d6 Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Mon, 11 Nov 2024 09:56:08 +0200 Subject: [PATCH 33/49] Following the discussion updated support for disconnected cluster installation as a non-goal Signed-off-by: Michael Shitrit --- enhancements/two-nodes-openshift/2no.md | 1 + 1 file changed, 1 insertion(+) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 86ad8f9d91..1d631c9a0b 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -88,6 +88,7 @@ This requires our solution to provide a management experience consistent with "n * Resilient storage - see future enhancement * Support for platforms other than bare metal including automated CI testing * Support for other topologies (eg. hypershift) +* Support disconnected cluster installation * Adding worker nodes * Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) * Supporting upgrade/downgrade paths between two-node and other architectures (for initial release) From 2c5dbb9bb6cf093d50ea69957f3e76f8744d767a Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Wed, 20 Nov 2024 15:03:19 -0500 Subject: [PATCH 34/49] Updated details of installation and unanswered questions. In this update, I went over the installation section and provided firmer details around the installation strategy. I tried to highlight where there were questions yet to be answered. I also started fleshing out the HyperShift and Version Skew sections. --- enhancements/two-nodes-openshift/2no.md | 92 ++++++++++++------------- 1 file changed, 43 insertions(+), 49 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 1d631c9a0b..5f4abc7022 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -36,17 +36,17 @@ tracking-link: RHEL-HA - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. -Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. +Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover for critical services and applications. -Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. +Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. * C-quorum: quorum as determined by Corosync members and algorithms * E-quorum: quorum as determined by etcd members and algorithms -Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically results in data loss or corruption. +Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically, it results in data loss or corruption. MCO - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. @@ -79,7 +79,7 @@ This requires our solution to provide a management experience consistent with "n * Minimize recovery-caused unavailability. Eg. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability. * Recover the API server in less than 120s, as measured by the surviving node's detection of a failure * Minimize any differences to existing OpenShift topologies -* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures +* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures * Provide an OpenShift cluster experience that is similar to that of a 3-node hyperconverged cluster but with 2 nodes ### Non-Goals @@ -119,12 +119,24 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte #### Cluster Creation -User creation of a two-node control-plane will be possible via the Assisted Installer. +User creation of a two-node control-plane will be possible via the Assisted Installer. A key requirement is that the cluster can be deployed using only 2 nodes because requiring a third baremetal server for installation can be expensive when deploying baremetal at scale. To accomplish this, deployments will take advantage of the Assisted Installer's ability to use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. There is a critical transition during this process, where to maintain etcd quorum, the bootstrap node will need to be removed from the etcd cluster before it is rebooted so that quorum can be maintained as the machine reboots into a second control-plane. -The procedure follows the standard flow except for the configuration of 2 nodes instead of 3, and the collection of RedFish details (including passwords!) for each node which are needed for the RHEL-HA configuration. -If available via the SaaS offering (not confirmed, ZTP may be the target), the offering will need to ensure passwords are appropriately handled. +Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. At this time we've discussed the collection of RedFish details (including passwords!) for each node. This is needed for the RHEL-HA configuration by leveraging the BareMetalHost CRDs populated from the baremetal platform specification in the install-config. There are open questions on how to ensure that pacemaker is the only entity responsible for fencing to prevent conflicting requests to change the machine state between pacemaker and the baremetal operator. Preventing conflicting fencing logic is also important for optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation, but these should not be present during installation. -Everything else about cluster creation will be an opaque implementation detail not exposed to the user. +An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by pacemaker. The basic transition works as follows: +1. MCO Extensions are used to ensure that pacemaker, corosync, and resource agents are pre-configured on CoreOS using installation manifests. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or an additional operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. +4. The installation proceeds as normal once the pods start. +If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will need to be able to pull logs from the control-plane nodes to provide context for this failure. + +Fencing setup is the last important aspect of the cluster installation. In order for the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. Ideally, the fencing secrets should be made available to the control-plane nodes in the initial pacemaker initialization so that fencing can be configured during step 2. There are a few more critical open questions with this: +1. Should fencing be made active during the installation or should pacemaker start with it disabled and only enable it after being signaled by the in-cluster entity when the cluster installation is detected as successful? +2. What mechanism will pacemaker use to get access to the secret linked from the BareMetalHost CRD? + +If available via the SaaS offering (not confirmed), ZTP may be evaluated as a future offering. This will need further evaluation to ensure passwords are appropriately handled. + +Everything else about cluster creation will be an opaque implementation detail not exposed to the user. #### Day 2 Procedures @@ -135,7 +147,7 @@ The experience of managing a 2-node control-plane should be largely indistinguis The primary exception is (re)booting one of the peers while the other is offline and expected to remain so. As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. -Should the admin wish for the control-plane to start, the admin will need to execute a supplied confirmation command on the active cluster node. +Should the admin wish for the control-plane to start, the admin will need to execute a supplied confirmation command on the active cluster node. This command will grant quorum to the RHEL-HA components, authorizing it to fence its peer and start etcd as a cluster-of-one in read/write mode. Confirmation can be given at any point and optionally make use of SSH to facilitate initiation by an external script. @@ -152,8 +164,7 @@ A mechanism is needed for components of the cluster to understand that this is a We will define a new value for the `TopologyMode` enum: `DualReplica`. The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`. -However `TopologyMode` is not available at the point the Agent Based Installer (ABI) performs validation. -We will therefore additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml`, and which ABI can use to validate the proposed cluster - such as the proposed node count. +We will additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml` to ensure the feature can be set as `TechPreviewNoUpgrade`. #### CEO Trigger @@ -171,20 +182,21 @@ This will allow the use of a credential scoped to `ConfigMap`s in the `openshift #### Standalone Clusters -Is the change relevant for standalone clusters? -TODO: Exactly what is the definition of a standalone cluster? Disconnected? Physical hardware? +Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. ### Implementation Details/Notes/Constraints -While the target installation requires exactly 2 nodes, this will be achieved by building support in the core installer for a "bootstrap plus 2 nodes" flow and then using the Assisted Installer's ability to bootstrap-in-place to remove the requirement for a bootstrap node. +While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using the Assisted Installer's ability to bootstrap from one of the target machines to remove the requirement for a bootstrap node. + +So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology. The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Configuration of the RHEL-HA components will be via one or more `MachineConfig`s and will require RedFish details to have been collected by the installer. +Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster operator or a function of CEO triggering a script on one of the control-plane nodes. This initialization will require that RedFish details have been collected by the installer. Sensible defaults will be chosen where possible, and user customization only where necessary. -The entity (likely a one-shot systemd job as part of a `MachineConfig`) that configures RHEL-HA will also configure a fencing priority. +This RHEL-HA initialization script will also configure a fencing priority. This is usually done based on the sort order of a piece of shared info (such as IP or node name). The priority takes the form of a delay, usually in the order of 10s of seconds, and is used to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. @@ -198,6 +210,8 @@ More information on creating OCF agents can be found in the upstream [developer Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. +As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. The etcd pods are different in this respect since they will be restarted using Podman. It may be possible to start these with the same user account as the original pods. + #### Failure Scenario Timelines: 1. Cold Boot @@ -272,19 +286,11 @@ Tools for extracting support information (must-gather tarballs) will be updated #### Hypershift / Hosted Control Planes -Are there any unique considerations for making this change work with -Hypershift? - -See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 - -How does it affect any of the components running in the -management cluster? How does it affect any components running split -between the management cluster and guest cluster? +This topology is anti-synergistic with HyperShift. As the management cluster, a cost-sensitive control-plane runs counter to the the proposition of highly-scaleable hosted control-planes since your compute resources are limited. As the hosted cluster, the benefit of hypershift is that your control-planes are running as pods in the management cluster. Reducing the number of instances of control-plane nodes would trade the minimal cost of a third set of control-plane pods at the cost of having to implement fencing between your control-plane pods. #### Single-node Deployments or MicroShift -How does this proposal affect the resource consumption of a -single-node OpenShift deployment (SNO), CPU and memory? +This proposal is an alternative architecture to Single-node and MicroShift, so it shouldn't introduce any complications for those topologies. ### Risks and Mitigations @@ -312,8 +318,6 @@ single-node OpenShift deployment (SNO), CPU and memory? 1. Mitigation: ... CI ... - - ### Drawbacks The two-node architecture represents yet another distinct install type for users to choose from. @@ -328,16 +332,14 @@ Satisfying this demand would come with significant technical and support overhea 2. Are there consequences of changing the parentage of processes running cri-o, kubelet, and etcd? (E.g. user process limits) 3. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? 4. How are the BMC credentials getting from the install-config and onto the nodes? -5. How does the cluster know it has achieved a ready state? - From the cluster's perspective, we know that CEO needs to have `managedEtcd: External` set, and all of the operators need to be available. However, there is a time delay between when that configuration is set by CEO and when etcd comes back up as health after it's restarted under the RHEL-HA components. Is a fixed wait enough to determine that we have successfully transitioned to the new topology? If not, how do we detect this? -6. Are there any scenarios that would require signaling from the cluster to the RHEL-HA components to modify or change their behavior? -7. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? -8. Which installers will be supported for the initial release? - The current discussions around the installer point us towards ABI for the initial release. There also seems to be interest in making this available for ZTP for managed clusters. -9. Which platform specs will be available for this topology? +5. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? +6. Which platform specs will be available for this topology? As discussed, we are currently targeting the BareMetalPlatform spec, but the load-balancing component needs to be evaluated for compatibility. -10. What happens if something fails during the initial setup of the RHEL-HA stack? How will this be communicated back to the user? - For example, what happens if the setup job fails and etcd is left running? From the perspective of the user and the cluster, this would be identical to etcd being stopped and restarted under the control of the RHEL-HA components. Nothing about the cluster knows about the external entity that owns etcd. +7. What in-cluster entity will be responsible for initializing pacemaker? +We've narrowed this down to either CEO or a 2NO-specific operator. The advantage of accomplishing this in CEO is that it could be tested and maintained by the control-plane team, and will always need to be tested alongside etcd. The advantage of introducing a new operator is that it gives us greater flexibility over the design. +8. What in-cluster entity will be responsible for preparing fencing credentials for pacemaker to consume? +Similar to the question above, this can probably be done by CEO, BMO, or a new operator. +9. What happens if a cluster's fencing credentials are rotated after installation? ## Test Plan @@ -413,21 +415,13 @@ end to end tests.** ## Upgrade / Downgrade Strategy -In-place upgrades and downgrades will not be supported for this first iteration, and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. +In-place upgrades and downgrades will not be supported for this first iteration and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. ## Version Skew Strategy -How will the component handle version skew with other components? -What are the guarantees? Make sure this is in the test plan. - -Consider the following in developing a version skew strategy for this -enhancement: -- During an upgrade, we will always have skew among components, how will this impact your work? -- Does this enhancement involve coordinating behavior in the control-plane and - the kubelet? How does an n-2 kubelet without this feature available behave - when this feature is used? -- Will any other components on the node change? For example, changes to CSI, CRI, - or CNI may require updating that component before the kubelet. +Most components of this enhancement are external to the cluster itself. The main challenge with upgrading +is ensuring the cluster stays functional and consistent through the reboots of the upgrade. We may +need to revisit this if we decide to introduce our own operator. ## Operational Aspects of API Extensions From 156c11aceb22bf1b7fcfa6e82d0cdcbe49aec9b9 Mon Sep 17 00:00:00 2001 From: Carlo Lobrano Date: Wed, 27 Nov 2024 11:16:21 +0100 Subject: [PATCH 35/49] clarify etcd handling during node failures --- enhancements/two-nodes-openshift/2no.md | 34 +++++++++++++------------ 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 5f4abc7022..9a71c7387e 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -220,19 +220,21 @@ As part of the fencing setup, the cri-o and kubelet services will still be owned 3. Node1 does not start etcd or kubelet, remains inert waiting for Node2 4. Peer (Node2) boots 5. Corosync membership containing both nodes forms - 6. Pacemaker starts etcd on both nodes - * Detail, this could be a “soft”-start which allows us to determine which node has the most recent dataset. - 7. Pacemaker “promotes” etcd on whichever node has the most recent dataset - 8. Pacemaker “promotes” etcd on the peer once it has caught up - 9. Pacemaker starts kubelet on both nodes - 10. Fully functional cluster + 6. Pacemaker starts kubelet on both nodes + 7. Pacemaker starts etcd on both nodes + * if one node has a more recent dataset than the peer: + * Pacemaker starts etcd standalone on the node with the most recent dataset and adds the peer as learning member + * Pacemaker starts etcd on the peer as joining member + * otherwise, Pacemaker starts both instances as joining members + 10. CEO promotes the learning member as voting member + 11. Fully functional cluster 2. Network Failure 1. Corosync on both nodes detects separation 2. Etcd loses internal quorum (E-quorum) and goes read-only 3. Both sides retain C-quorum and initiate fencing of the other side. RHEL-HA's fencing priority avoids parallel fencing operations and thus the total shutdown of the system. 4. One side wins, pre-configured as Node1 - 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 5. Pacemaker on Node1 restarts etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 6. Cluster continues with no redundancy 7. … time passes … 8. Node2 boots - persistent network failure @@ -240,15 +242,15 @@ As part of the fencing setup, the cri-o and kubelet services will still be owned * Node2 does not start etcd or kubelet, remains inert waiting for Node1 9. Network is repaired 10. Corosync membership containing both nodes forms - 11. Pacemaker “starts” etcd on Node2 as a follower of Node1 - 12. Pacemaker “promotes” etcd on Node2 as a full replica of Node1 - 13. Pacemaker starts kubelet + 11. Pacemaker starts kubelet + 12. Pacemaker detects etcd is running standalone already on the peer, it backs up the etcd data and resets the etcd state to allow Node2 to start as a follower of Node1 + 13. CEO promotes etcd on Node2 as a voting member 14. Cluster continues with 1+1 redundancy 3. Node Failure 1. Corosync on the survivor (Node1) 2. Etcd loses internal quorum (E-quorum) and goes read-only 3. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 - 4. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 4. Pacemaker on Node1 restarts etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 5. Cluster continues with no redundancy 6. … time passes … 7. Node2 has a persistent failure that prevents communication with Node1 @@ -256,16 +258,16 @@ As part of the fencing setup, the cri-o and kubelet services will still be owned * Node2 does not start etcd or kubelet, remains inert waiting for Node1 8. Persistent failure on Node2 is repaired 9. Corosync membership containing both nodes forms - 10. Pacemaker “starts” etcd on Node2 as a follower of Node1 - 11. Pacemaker “promotes” etcd on Node2 as a full replica of Node1 - 12. Pacemaker starts kubelet + 10. Pacemaker starts kubelet + 11. Pacemaker detects etcd is running standalone already on the peer, it backs up the etcd data and resets the etcd state to allow Node2 to start as a follower of Node1 + 12. CEO promotes etcd on Node2 as a voting member 13. Cluster continues with 1+1 redundancy 4. Two Failures 1. Node2 failure (1st failure) 2. Corosync on the survivor (Node1) 3. Etcd loses internal quorum (E-quorum) and goes read-only 4. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 - 5. Pacemaker on Node1 forces E-quorum (etcd promotion event) + 5. Pacemaker on Node1 restarts Etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 6. Cluster continues with no redundancy 7. Node1 experience a power failure (2nd Failure) 8. … time passes … @@ -280,7 +282,7 @@ As part of the fencing setup, the cri-o and kubelet services will still be owned 4. Start failure defaults to leaving the service offline 6. Etcd Failure 1. Pacemaker’s monitoring detects the failure - 2. Pacemaker either demotes etcd so it can resync, or restarts and promotes etcd + 2. Pacemaker removes etcd from the members list and restart it, so it can resync 3. Stop failure is optionally escalated to a node failure (fencing) 4. Start failure defaults to leaving the service offline From e28d72524044a0e76c9ec7a3e06717f09c91a82e Mon Sep 17 00:00:00 2001 From: Carlo Lobrano Date: Thu, 5 Dec 2024 08:36:48 +0100 Subject: [PATCH 36/49] add handling of etcd failure by OCF script --- enhancements/two-nodes-openshift/2no.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 9a71c7387e..07543446c7 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -107,6 +107,8 @@ Upon a peer failure, the RHEL-HA components on the survivor will fence the peer Upon a network failure, the RHEL-HA components ensure that exactly one node will survive, fence its peer, and use the OCF script to restart etcd as a new cluster-of-one. +Upon an etcd failure, the OCF script will detect the issue and try to restart etcd. + In both cases, the control-plane's dependence on etcd will cause it to respond with errors until etcd has been restarted. Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees its peer. From 242842b8c636ae107cce3c449ddcff7929e15f14 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Tue, 3 Dec 2024 15:53:50 -0500 Subject: [PATCH 37/49] OCPEDGE-1318: Filled out remaining sections and updates for fencing. This is a collection of updates which attempts to clean up some of the remaining template-boiler plate of the enhancement. I've also added sections to explain recent discussions, like: - Baremetal Platform vs. None - Fencing initialization via an in-cluster operator - Why the Assisted Installer SaaS is probably not compatible - Whether or note in-cluster networking is compatible --- enhancements/two-nodes-openshift/2no.md | 242 +++++++++++++++--------- 1 file changed, 154 insertions(+), 88 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index 07543446c7..c03d193cbd 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -7,7 +7,7 @@ reviewers: - "@rwsu" - "@fabbione" - "@carbonin" - - "@thomasjungblut" + - "@tjungblu" - "@brandisher" - "@DanielFroehlich" - "@jerpeter1" @@ -20,7 +20,7 @@ reviewers: - "@clobrano" - "@cybertron" approvers: - - "@thomasjungblut" + - "@tjungblu" - "@jerpeter1" api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" - "@deads2k" @@ -52,9 +52,15 @@ MCO - Machine Config Operator. This operator manages updates to the node's syste ABI - Agent-Based Installer. +BMO - Baremetal Operator + +CEO - Cluster Etcd Operator + +BMC - Baseboard Management Console. Used to manage baremetal machines. Can modify firmware settings and machine power state. + ## Summary -Leverage traditional high-availability concepts and technologies to provide a container management solution suitable that has a minimal footprint but remains resilient to single node-level failures suitable for customers with numerous geographically dispersed locations. +Leverage traditional high-availability concepts and technologies to provide a container management solution that has a minimal footprint but remains resilient to single node-level failures suitable for customers with numerous geographically dispersed locations. ## Motivation @@ -91,17 +97,17 @@ This requires our solution to provide a management experience consistent with "n * Support disconnected cluster installation * Adding worker nodes * Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) -* Supporting upgrade/downgrade paths between two-node and other architectures (for initial release) +* Supporting upgrade/downgrade paths between two-node and other topologies (e.g. 3-node compact) (for initial release) ## Proposal -Use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. -etcd will run as as a voting member on both nodes. +We will use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. +Etcd will run as a a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kubelet units, as well as create a new Open Cluster Framework (OCF) script for etcd. The existing startup order of cri-o, then kubelet, then etcd will be preserved. -The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod, the contents of which remain under the exclusive control of the Cluster Etcd Operator (CEO). +The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod definitions, the contents of which remain under the exclusive control of the Cluster Etcd Operator (CEO). -Use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) an unreachable peer and ensure that it can do no harm while the remaining node continues. +In the case of an unreachable peer, we will use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) the unreachable node and ensure that it cannot harm while the remaining node continues. Upon a peer failure, the RHEL-HA components on the survivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. @@ -113,7 +119,7 @@ In both cases, the control-plane's dependence on etcd will cause it to respond w Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees its peer. If the failed peer is likely to remain offline for an extended period, admin confirmation is required on the remaining node to allow it to start OpenShift. -The functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. +This functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. @@ -121,24 +127,49 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte #### Cluster Creation -User creation of a two-node control-plane will be possible via the Assisted Installer. A key requirement is that the cluster can be deployed using only 2 nodes because requiring a third baremetal server for installation can be expensive when deploying baremetal at scale. To accomplish this, deployments will take advantage of the Assisted Installer's ability to use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. There is a critical transition during this process, where to maintain etcd quorum, the bootstrap node will need to be removed from the etcd cluster before it is rebooted so that quorum can be maintained as the machine reboots into a second control-plane. +User creation of a two-node control-plane is possible via the Assisted Installer and the Agent-Based Installer (ABI). The initial implementation will focus on providing support for the Assisted Installer in managed cluster environments (i.e. ACM), followed by stand-alone cluster support via the Agent-Based Installer. +The requirement that the cluster can be deployed using only 2 nodes is key because requiring a third baremetal server for installation can be expensive when deploying baremetal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. + +A critical transition during bootstrapping is when the bootstrap reboots into the control plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained as the machine reboots into a second control-plane. + +Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. + +Because BMC passwords are being collected to initialize fencing, the SaaS offering will not be available (to avoid storing customer BMC credentials in a Red Hat database). ZTP may be considered in the future, but this will need further evaluation to ensure passwords are appropriately handled. + +Everything else about cluster creation will be an opaque implementation detail not exposed to the user. + +##### Transitioning to a RHEL-HA Controlled Cluster +Three aspects of cluster creation need to happen for a vanilla two-node cluster to have RHEL-HA functioning as described in the proposal. +1. Initializing the RHEL-HA cluster +2. Transitioning control of etcd to RHEL-HA +3. Enabling fencing in RHEL-HA -Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. At this time we've discussed the collection of RedFish details (including passwords!) for each node. This is needed for the RHEL-HA configuration by leveraging the BareMetalHost CRDs populated from the baremetal platform specification in the install-config. There are open questions on how to ensure that pacemaker is the only entity responsible for fencing to prevent conflicting requests to change the machine state between pacemaker and the baremetal operator. Preventing conflicting fencing logic is also important for optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation, but these should not be present during installation. +We propose the inclusion of a new in-cluster operator responsible for the remediation of 2NO resource agents. -An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by pacemaker. The basic transition works as follows: +###### Transitioning Etcd Management to RHEL-HA +An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by RHEL-HA. The basic transition works as follows: 1. MCO Extensions are used to ensure that pacemaker, corosync, and resource agents are pre-configured on CoreOS using installation manifests. -2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or an additional operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. -3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or the proposed 2NO resource agent operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. 4. The installation proceeds as normal once the pods start. -If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will need to be able to pull logs from the control-plane nodes to provide context for this failure. +If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. -Fencing setup is the last important aspect of the cluster installation. In order for the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. Ideally, the fencing secrets should be made available to the control-plane nodes in the initial pacemaker initialization so that fencing can be configured during step 2. There are a few more critical open questions with this: -1. Should fencing be made active during the installation or should pacemaker start with it disabled and only enable it after being signaled by the in-cluster entity when the cluster installation is detected as successful? -2. What mechanism will pacemaker use to get access to the secret linked from the BareMetalHost CRD? +###### Configuring Fencing Via 2NO Resource Agent Remediation Operator +Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. In order to do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. +To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config. +This will take a format similar to that of the [Baremetal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). +We will create a new CRD that stores the BMC address as well as a credentialsName to be managed by the new operator. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. -If available via the SaaS offering (not confirmed), ZTP may be evaluated as a future offering. This will need further evaluation to ensure passwords are appropriately handled. +BMC information can be used to change the power state of a baremetal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. +This means that we need to ensure that there are protections in the Baremetal Operator (BMO) to prevent control plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these should not be present during installation. -Everything else about cluster creation will be an opaque implementation detail not exposed to the user. +For a two-node cluster to be successful, we need to ensure the following: +1. The new operator must be able to watch for changes to the new CRD and synchronize the BMC secrets for RHEL-HA once during initialization and again whenever the CRD is updated. +3. The new operator should run the `pcs` configuration command on the nodes to enable the fencing agent and report an error to the user if the fencing configuration isn't successful. In the case of installation, this means that the operator becomes degraded to ensure that the installation fails. +4. The new operator should periodically check that the fencing agent is healthy to ensure that the user can be notified in case the credential is rotated and needs to be updated. +5. In the case that the CRD cannot be updated because etcd is unavailable, it must be possible to log in to the node and run the `pcs` configuration commands manually to recover the node. Upon recovery, the operator should detect that its credential is out of date and prompt the user for an update. + +A final note about fencing is that pacemaker will cache the credentials directly on the nodes. This is important because the cluster may not be able to respond to requests to access the BMC credentials when it is needed for fencing (e.g. when you lose quorum due to a network failure). #### Day 2 Procedures @@ -155,10 +186,11 @@ Confirmation can be given at any point and optionally make use of SSH to facilit ### API Extensions -There are two related but ultimately orthogonal capabilities that may require API extensions. +There are three known capabilities that require API extensions. -1. Identify the cluster as having a unique topology -2. Tell CEO when it is safe for it to disable certain membership-related functionalities +1. Identifying two-node control-plane clusters as a unique topology +2. Telling CEO when it is safe for it to disable certain membership-related functionalities +3. Creating the CRD used for fencing enablement and resource agent remediation #### Unique Topology @@ -168,7 +200,7 @@ The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fie We will additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml` to ensure the feature can be set as `TechPreviewNoUpgrade`. -#### CEO Trigger +#### CEO Externally Managed Etcd Initially, the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. @@ -178,29 +210,37 @@ This will be achieved by having the same entity that drives the configuration of To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. This will allow the use of a credential scoped to `ConfigMap`s in the `openshift-etcd-operator` namespace, to make the change. +The plan is for this to be changed by one of the nodes during pacemaker initialization. Pacemaker initialization should be initiated by CEO when it detects that the cluster controlPlane topology is set to `DualReplica`. + +#### 2NO Resource Agent Remediation Operator + +From a high level, the proposed in-cluster operator's job is to ensure that the RHEL-HA resource agents that need to be running to have a safe 2-node cluster are healthy. +The most involved aspect of this is collecting BMC credentials, and ensuring that the RHEL-HA components can sync them to disk during initialization or whenever these secrets are updated. + ### Topology Considerations 2NO represents a new topology and is not appropriate for use with HyperShift, SNO, or MicroShift #### Standalone Clusters -Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. +Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. To achieve this, we will need to enable the installation +of two-node clusters via the Agent-Based Installer to ensure that we are still meeting the installation requirement of using only 2 nodes. ### Implementation Details/Notes/Constraints -While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using the Assisted Installer's ability to bootstrap from one of the target machines to remove the requirement for a bootstrap node. +While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to bootstrap from one of the target machines to remove the requirement for a bootstrap node. So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology. The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster operator or a function of CEO triggering a script on one of the control-plane nodes. This initialization will require that RedFish details have been collected by the installer. +Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster operator or a function of CEO triggering a script on one of the control-plane nodes. +Regardless, this initialization will require that RedFish details have been collected by the installer and synced to the nodes. + Sensible defaults will be chosen where possible, and user customization only where necessary. -This RHEL-HA initialization script will also configure a fencing priority. -This is usually done based on the sort order of a piece of shared info (such as IP or node name). -The priority takes the form of a delay, usually in the order of 10s of seconds, and is used to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. +This RHEL-HA initialization script will also configure a fencing priority for the nodes - alphabetically by name. The priority takes the form of a delay, where the second node will wait 20 seconds to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. RHEL-HA has no real understanding of the resources (IP addresses, file systems, databases, even virtual machines) it manages. It relies on resource agents to understand how to check the state of a resource, as well as start and stop them to achieve the desired target state. @@ -212,9 +252,38 @@ More information on creating OCF agents can be found in the upstream [developer Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. -As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. The etcd pods are different in this respect since they will be restarted using Podman. It may be possible to start these with the same user account as the original pods. +As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. +The etcd pods are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. + +#### Platform None vs. Baremetal +One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is +that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is +expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, +it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. + +Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use +the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of an new operator. After a discussion with the Baremetal Platform team, +we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes. This means that the Baremetal Operator is not desirable for a 2-node cluster, because +we don't intend on supporting compute nodes. If you want to add nodes to your cluster, you'll need to migrate your workloads to a cluster running a new topology. Even if this requirement were +relaxed in the future, the transition path should probably be from 2-node to 3-node compact, not 2-node plus a compute node. + +Given the likelihood of customers being sensitive to the footprint of the platform operators running on the cluster, the safest path forward is to target 2NO clusters on `platform: None` clusters. +By default, this will require customers to provide an ingress load balancer; however, if it isn't a business requirement, we can work with the Metal Networking team to prioritize this +as a feature for `platform: none` clusters in the future. + +#### Graceful vs. Unplanned Reboots +Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. +This includes events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - +the node that is shutting down must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster +and sync its database to the active node. + +Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning +off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves +pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new +cluster before it can rejoin as an active peer. #### Failure Scenario Timelines: +This section provides specific steps for how two-node clusters would handle interesting events. 1. Cold Boot 1. One node (Node1) boots @@ -300,26 +369,29 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i 1. Risk: If etcd were to be made active on both peers during a network split, divergent datasets would be created 1. Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster-of-one - 1. Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer + 2. Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer -1. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain +2. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain 1. Mitigation: The CEO will run in a mode that does manage not etcd membership -1. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites +3. Risk: Other operators that perform power-management functions could conflict with pacemaker. + 1. Mitigation: Update the Baremetal and Node Health Check operators to ensure control plane nodes can not perform power operations in the 2-node topology. + +4. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites 1. Mitigation: Lifecycle events, such as upgrades and applying new `MachineConfig`s, are not permitted in a single-node degraded state - 1. Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will further reduce the need for reboots. - 1. Mitigation: The node will be reachable via SSH and the confirmation can be scripted - 1. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. + 2. Mitigation: Usage of the MCO Admin Defined Node Disruption [feature](https://github.com/openshift/enhancements/pull/1525) will further reduce the need for reboots. + 3. Mitigation: The node will be reachable via SSH and the confirmation can be scripted + 4. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. -1. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. +5. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. This may require an additional enhancement. -1. Risk: We may not succeed in identifying all the reasons a node will reboot +6. Risk: We may not succeed in identifying all the reasons a node will reboot 1. Mitigation: ... testing? ... -1. Risk: This new platform will have a unique installation flow - 1. Mitigation: ... CI ... +7. Risk: This new platform will have a unique installation flow + 1. Mitigation: A new CI lane will be created for this topology ### Drawbacks @@ -330,21 +402,9 @@ The existence of 1, 2, and 3+ node control-plane sizes will likely generate cust Satisfying this demand would come with significant technical and support overhead which is out of scope for this enhancement. ## Open Questions [optional] - 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? -2. Are there consequences of changing the parentage of processes running cri-o, kubelet, and etcd? (E.g. user process limits) -3. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? -4. How are the BMC credentials getting from the install-config and onto the nodes? -5. Are there incompatibilities between the existing design and the function of the load balancer deployed through the BareMetalPlatform spec? -6. Which platform specs will be available for this topology? - As discussed, we are currently targeting the BareMetalPlatform spec, but the load-balancing component needs to be evaluated for compatibility. -7. What in-cluster entity will be responsible for initializing pacemaker? -We've narrowed this down to either CEO or a 2NO-specific operator. The advantage of accomplishing this in CEO is that it could be tested and maintained by the control-plane team, and will always need to be tested alongside etcd. The advantage of introducing a new operator is that it gives us greater flexibility over the design. -8. What in-cluster entity will be responsible for preparing fencing credentials for pacemaker to consume? -Similar to the question above, this can probably be done by CEO, BMO, or a new operator. -9. What happens if a cluster's fencing credentials are rotated after installation? - +2. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? ## Test Plan @@ -388,25 +448,19 @@ Additionally, it would be good to have workload-specific testing once those are **Note:** *Section not required until targeted at a release.* -See template for guidelines/instructions. - ### Dev Preview -> Tech Preview -- Ability to utilize the enhancement end to end +- Ability to install a 2-node cluster using assisted installer (via ACM) and agent-based installer - End user documentation, relative API stability -- Sufficient test coverage -- Gather feedback from users rather than just developers -- Enumerate service level indicators (SLIs), expose SLIs as metrics -- Write symptoms-based alerts for the component(s) +- Sufficient test coverage (see test plan above) ### Tech Preview -> GA -- More testing (upgrade, downgrade, scale) -- Sufficient time for feedback +- Working upgrades +- Upgrade tests - Available by default - Backhaul SLI telemetry -- Document SLOs for the component -- Conduct load testing +- Performance testing - User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) **For non-optional features moving to GA, the graduation criteria must include @@ -419,18 +473,19 @@ end to end tests.** ## Upgrade / Downgrade Strategy -In-place upgrades and downgrades will not be supported for this first iteration and will be addressed as a separate feature in another enhancement. Upgrades will initially only be achieved by redeploying the machine and its workload. +This topology has the same expectations for upgrades as the other variants of OpenShift. +For tech preview, upgrades will only be achieved by redeploying the machine and its workload. However, +fully automated upgrades are a requirement for graduating to GA. + +Downgrades are not supported outside of redeployment. ## Version Skew Strategy -Most components of this enhancement are external to the cluster itself. The main challenge with upgrading -is ensuring the cluster stays functional and consistent through the reboots of the upgrade. We may -need to revisit this if we decide to introduce our own operator. +Most components introduced in this enhancement are external to the cluster itself. The main challenge with upgrading +is ensuring the cluster stays functional and consistent through the reboots of the upgrade. This ## Operational Aspects of API Extensions -See template for guidelines/instructions. - - For conversion/admission webhooks and aggregated API servers: what are the SLIs (Service Level Indicators) an administrator or support can use to determine the health of the API extensions @@ -439,38 +494,51 @@ See template for guidelines/instructions. - What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, API availability) - [TODO: Expand] Toggling CEO control values with result in etcd being briefly offline. + Toggling CEO control values with result in etcd being briefly offline. The transition is almost immediate, though, since the resource agent is watching for the + etcd pod to disappear so it can start its replacement. + + The other potential impact is around reboots. There may be a small performance impact when the nodes reboot since they have to leave the etcd cluster and resync etcd to join. - How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review this enhancement) + The impact of the etcd transition as well as the reboot interactions with etcd are likely compatible with existing SLIs. + - Describe the possible failure modes of the API extensions. + There shouldn't be any failures introduced by adding a new topology. + Etcd transitioning from CEO-managed to externally managed should be a one-way switch. If it fails, it should result in a failed installation. + Similarly, introducing an operator to remediate the fencing agent exists so that we can degrade the operator if the fencing credentials are invalid or if fencing cannot be initialized, + leading to a failed installation. The special case is if the BMC credentials are rotated. Here, the operator's role is to notify the cluster admin that the pacemaker fencing agent is unhealthy. + + If the administrator fails to remediate this before a fencing operation is required, then manual recovery will be required by SSH-ing to the node. + - Describe how a failure or behaviour of the extension will impact the overall cluster health (e.g. which kube-controller-manager functionality will stop working), especially regarding stability, availability, performance, and security. + + As mentioned above, a BMC credential rotation could result in a customer "breaking" pacemaker's ability to fence nodes. On its own, pacemaker has no way of communicating this kind + of failure to the cluster admin. The proposed 2NO resource agent remediation operator could detect this and warn the cluster administrator that action is required. Alternatively, + RHEL-HA could detect this and immediately respond by turning off etcd for one of the nodes - forcing the cluster into read-only mode until a user manually logs in and restores the fencing agent. + To minimize disruption to user workloads, it would be best to avoid leaving the cluster in a read-only state unless we enter a state where fencing would be required for the cluster + to recover. + - Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes and add them as reviewers to this enhancement. -## Support Procedures + In case of escalation, the most likely team affected outside of Edge Enablement (or whoever owns the proposed topology) is the etcd team because forcing the cluster into a read-only + state is the primary mechanism two-node can use to protect against data corruption/loss. -See template for guidelines/instructions. +## Support Procedures -Describe how to -- detect the failure modes in a support situation, describe possible symptoms (events, metrics, - alerts, which log output in which component) -- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) - - What consequences does it have on the cluster health? - - What consequences does it have on existing, running workloads? - - What consequences does it have for newly created workloads? -- Does functionality fail gracefully and will work resume when re-enabled without risking - consistency? +- Failure logs for pacemaker will be available in the system journal. The installer should report these in the case that a cluster cannot successfully initialize pacemaker. +- A BMC connection failure detected after the cluster is installed can be remediated as long as the cluster is healthy and can be done by updating the CRD for the proposed 2NO resource agent remediation operator. +- In the case of a failed two-node cluster, there is no supported way of migrating to a different topology. The most practical option would be to deploy a fresh environment. ## Alternatives -* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes - +* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes * 2 SNO + KCP [KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. @@ -481,8 +549,6 @@ Disadvantages: * KCP itself could become a single point of failure (need to configure pacemaker to manage KCP) * KCP adds additional complexity to the architecture - ## Infrastructure Needed [optional] -Use this section if you need things from the project. Examples include a new -subproject, repos requested, GitHub details, and/or testing infrastructure. +A new repository in the OpenShift GitHub organization will be created for the 2NO resource agent remediation operator. From e594d947c389043e749eb1e88a0feefa88fa8bd4 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Mon, 9 Dec 2024 21:39:52 -0500 Subject: [PATCH 38/49] Enhancement updates around remaining open questions and selected designs. - Added explanations around which installers may be evaluated for further business cases - Added examples for platform extensions for providing fencing credentials. - Removed all references to pacemaker remediation operator. - Added a section explaining why we moved away from a operator-based design. - Noted open question around how to inform user of cluster health risk related to pacemaker. - Noted open question on how to reconcile updates to etcd pod - Made sure etcd is always lower case - Update `none` vs `baremetal` to explain why we should target both --- enhancements/two-nodes-openshift/2no.md | 276 +++++++++++++++++------- 1 file changed, 193 insertions(+), 83 deletions(-) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-nodes-openshift/2no.md index c03d193cbd..d043769702 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-nodes-openshift/2no.md @@ -54,7 +54,7 @@ ABI - Agent-Based Installer. BMO - Baremetal Operator -CEO - Cluster Etcd Operator +CEO - cluster-etcd-operator BMC - Baseboard Management Console. Used to manage baremetal machines. Can modify firmware settings and machine power state. @@ -101,11 +101,11 @@ This requires our solution to provide a management experience consistent with "n ## Proposal -We will use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported 2-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. -Etcd will run as a a voting member on both nodes. +We will use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported two-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. +We will run etcd as a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kubelet units, as well as create a new Open Cluster Framework (OCF) script for etcd. The existing startup order of cri-o, then kubelet, then etcd will be preserved. -The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod definitions, the contents of which remain under the exclusive control of the Cluster Etcd Operator (CEO). +The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod definitions, the contents of which remain under the exclusive control of the cluster-etcd-operator (CEO). In the case of an unreachable peer, we will use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) the unreachable node and ensure that it cannot harm while the remaining node continues. @@ -130,11 +130,16 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte User creation of a two-node control-plane is possible via the Assisted Installer and the Agent-Based Installer (ABI). The initial implementation will focus on providing support for the Assisted Installer in managed cluster environments (i.e. ACM), followed by stand-alone cluster support via the Agent-Based Installer. The requirement that the cluster can be deployed using only 2 nodes is key because requiring a third baremetal server for installation can be expensive when deploying baremetal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. -A critical transition during bootstrapping is when the bootstrap reboots into the control plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained as the machine reboots into a second control-plane. +A critical transition during bootstrapping is when the bootstrap reboots into the control-plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained as the machine reboots into a second control-plane. Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. -Because BMC passwords are being collected to initialize fencing, the SaaS offering will not be available (to avoid storing customer BMC credentials in a Red Hat database). ZTP may be considered in the future, but this will need further evaluation to ensure passwords are appropriately handled. +To constrain the scope of support, we've targeted Assisted Installer (in ACM) and Agent-Based Installer (ABI) as our supported installation paths. Support for other installation paths +may be reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via the openshift-installer using an +auxiliary bootstrap node but we don't intend to support this for customers unless this becomes a business requirement. Similarly, ZTP may be evaluated as a future offering for clusters +deployed by ACM environments via Multi-Cluster Engine (MCE), Assisted Installer, and Baremetal Operator. + +Because BMC passwords are being collected to initialize fencing, the Assisted Installer SaaS offering will not be available (to avoid storing customer BMC credentials in a Red Hat database). Everything else about cluster creation will be an opaque implementation detail not exposed to the user. @@ -144,39 +149,38 @@ Three aspects of cluster creation need to happen for a vanilla two-node cluster 2. Transitioning control of etcd to RHEL-HA 3. Enabling fencing in RHEL-HA -We propose the inclusion of a new in-cluster operator responsible for the remediation of 2NO resource agents. - -###### Transitioning Etcd Management to RHEL-HA +###### Transitioning etcd Management to RHEL-HA An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by RHEL-HA. The basic transition works as follows: -1. MCO Extensions are used to ensure that pacemaker, corosync, and resource agents are pre-configured on CoreOS using installation manifests. -2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or the proposed 2NO resource agent operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +1. [MCO extensions](https://docs.openshift.com/container-platform/4.17/machine_configuration/machine-configs-configure.html#rhcos-add-extensions_machine-configs-configure) are used to ensure that the pacemaker and corosync RPMs are installed. The installer also creates MachineConfig manifests to pre-configure resource agents. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or a new 2NO setup operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. 3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. 4. The installation proceeds as normal once the pods start. If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. -###### Configuring Fencing Via 2NO Resource Agent Remediation Operator -Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. In order to do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. -To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config. +There is an open question regarding how to handle updates to the etcd pod definition if it needs to change or if certificates are rotated. + +###### Configuring Fencing Via MCO +Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. +To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes). This will take a format similar to that of the [Baremetal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). -We will create a new CRD that stores the BMC address as well as a credentialsName to be managed by the new operator. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. +We will create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. -BMC information can be used to change the power state of a baremetal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. -This means that we need to ensure that there are protections in the Baremetal Operator (BMO) to prevent control plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these should not be present during installation. +BMC information can be used to change the power state of a baremetal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. This means that we need to ensure that there are installer validations and validations in the Baremetal Operator (BMO) to prevent control-plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these are not present during installation. -For a two-node cluster to be successful, we need to ensure the following: -1. The new operator must be able to watch for changes to the new CRD and synchronize the BMC secrets for RHEL-HA once during initialization and again whenever the CRD is updated. -3. The new operator should run the `pcs` configuration command on the nodes to enable the fencing agent and report an error to the user if the fencing configuration isn't successful. In the case of installation, this means that the operator becomes degraded to ensure that the installation fails. -4. The new operator should periodically check that the fencing agent is healthy to ensure that the user can be notified in case the credential is rotated and needs to be updated. -5. In the case that the CRD cannot be updated because etcd is unavailable, it must be possible to log in to the node and run the `pcs` configuration commands manually to recover the node. Upon recovery, the operator should detect that its credential is out of date and prompt the user for an update. +See the API Extensions section below for sample install-configs. -A final note about fencing is that pacemaker will cache the credentials directly on the nodes. This is important because the cluster may not be able to respond to requests to access the BMC credentials when it is needed for fencing (e.g. when you lose quorum due to a network failure). +For a two-node cluster to be successful, we need to ensure the following: +1. The BMC secrets for RHEL-HA are created on disk during bootstrapping by the OpenShift installer via a MachineConfig. +2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error and the installation fails. +3. Pacemaker periodically checks that the fencing agent is healthy (i.e. can connect to the BMC) and throws a warning if it cannot access the BMC. There is an open question on what the user experience should be to raise this error to the user. +4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this warning will mean that pacemaker can only provide a best-effort recovery - so operations that require fencing will need manual recovery. #### Day 2 Procedures As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. Such operations will only proceed when both peers are online and healthy. -The experience of managing a 2-node control-plane should be largely indistinguishable from that of a 3-node one. +The experience of managing a two-node control-plane should be largely indistinguishable from that of a 3-node one. The primary exception is (re)booting one of the peers while the other is offline and expected to remain so. As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. @@ -186,25 +190,25 @@ Confirmation can be given at any point and optionally make use of SSH to facilit ### API Extensions -There are three known capabilities that require API extensions. +Three known capabilities require API extensions. 1. Identifying two-node control-plane clusters as a unique topology 2. Telling CEO when it is safe for it to disable certain membership-related functionalities -3. Creating the CRD used for fencing enablement and resource agent remediation +3. Collecting fencing credentials for pacemaker initialization in the install-config #### Unique Topology -A mechanism is needed for components of the cluster to understand that this is a 2-node control-plane topology which may require different handling. +A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling. We will define a new value for the `TopologyMode` enum: `DualReplica`. The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`. We will additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml` to ensure the feature can be set as `TechPreviewNoUpgrade`. -#### CEO Externally Managed Etcd +#### CEO Externally Managed etcd Initially, the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. -At this point, the Cluster Etcd Operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. +At this point, the cluster-etcd-operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. @@ -212,10 +216,99 @@ This will allow the use of a credential scoped to `ConfigMap`s in the `openshift The plan is for this to be changed by one of the nodes during pacemaker initialization. Pacemaker initialization should be initiated by CEO when it detects that the cluster controlPlane topology is set to `DualReplica`. -#### 2NO Resource Agent Remediation Operator - -From a high level, the proposed in-cluster operator's job is to ensure that the RHEL-HA resource agents that need to be running to have a safe 2-node cluster are healthy. -The most involved aspect of this is collecting BMC credentials, and ensuring that the RHEL-HA components can sync them to disk during initialization or whenever these secrets are updated. +#### Install Config with Fencing Credentials + +A sample install-config.yaml for `platform: none` type clusters would look like this: +``` +apiVersion: v1 +baseDomain: example.com +compute: +- name: worker + replicas: 0 +controlPlane: + name: master + replicas: 2 +metadata: + name: +platform: + none: + fencingCredentials: + bmc: + address: ipmi:// + username: + password: +pullSecret: '' +sshKey: '' +``` + +For platform baremetal, a valid configuration is quite similar. +``` +apiVersion: v1 +baseDomain: example.com +compute: +- name: worker + replicas: 0 +controlPlane: + name: master + replicas: 2 +metadata: + name: +platform: + baremetal: + fencingCredentials: + bmc: + address: ipmi:// + username: + password: + apiVIPs: + - + ingressVIPs: + - +pullSecret: '' +sshKey: '' +``` + +Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the +Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for +in the installer. +``` +apiVersion: v1 +baseDomain: example.com +compute: +- name: worker + replicas: 0 +controlPlane: + name: master + replicas: 2 +metadata: + name: +platform: + baremetal: + fencingCredentials: + bmc: + address: ipmi:// + username: + password: + apiVIPs: + - + ingressVIPs: + - + hosts: + - name: openshift-master-0 + role: master + bmc: + address: ipmi:// + username: + password: + - name: + role: master + bmc: + address: ipmi:// + username: + password: +pullSecret: '' +sshKey: '' +``` ### Topology Considerations @@ -235,7 +328,8 @@ So far, we've discovered topology-sensitive logic in ingress, authentication, CE The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster operator or a function of CEO triggering a script on one of the control-plane nodes. +Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster 2NO setup operator or a function of CEO triggering a script on one of the control-plane nodes. +This script needs to be run with root permissions, so this is another factor to consider when evaluating if a new in-cluster operator is needed. Regardless, this initialization will require that RedFish details have been collected by the installer and synced to the nodes. Sensible defaults will be chosen where possible, and user customization only where necessary. @@ -255,32 +349,28 @@ Tools for extracting support information (must-gather tarballs) will be updated As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. The etcd pods are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. +#### 2NO Setup Operator + +From a high level, the proposed 2 setup operator's job is to ensure that the RHEL-HA components can be initialized with agents (resource, fencing, etc.). +The most involved aspect of this is triggering the pacemaker initialization script. It is an open question as to whether this should be a mechanism leveraged to notify +the user if one or more of these agents is unhealthy. + #### Platform None vs. Baremetal -One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is -that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is -expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, -it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. +One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. +The bottom line is that it will take some engineering effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift. + +Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Baremetal Platform team, we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes. + +This means that the Baremetal Operator is not initially in scope for a two-node cluster because we don't intend to support compute nodes. However, if this requirement were to change for future business opportunities, it may still be useful to provide the user with an install-time option for deploying the Baremetal Operator. -Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use -the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of an new operator. After a discussion with the Baremetal Platform team, -we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes. This means that the Baremetal Operator is not desirable for a 2-node cluster, because -we don't intend on supporting compute nodes. If you want to add nodes to your cluster, you'll need to migrate your workloads to a cluster running a new topology. Even if this requirement were -relaxed in the future, the transition path should probably be from 2-node to 3-node compact, not 2-node plus a compute node. +Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target 2NO clusters on both `platform: none` and platform `platform: baremetal` clusters. -Given the likelihood of customers being sensitive to the footprint of the platform operators running on the cluster, the safest path forward is to target 2NO clusters on `platform: None` clusters. -By default, this will require customers to provide an ingress load balancer; however, if it isn't a business requirement, we can work with the Metal Networking team to prioritize this -as a feature for `platform: none` clusters in the future. +For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future. #### Graceful vs. Unplanned Reboots -Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. -This includes events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - -the node that is shutting down must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster -and sync its database to the active node. +Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. This includes events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - the node that is shutting down must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster and sync its database to the active node. -Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning -off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves -pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new -cluster before it can rejoin as an active peer. +Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new cluster before it can rejoin as an active peer. #### Failure Scenario Timelines: This section provides specific steps for how two-node clusters would handle interesting events. @@ -301,7 +391,7 @@ This section provides specific steps for how two-node clusters would handle inte 11. Fully functional cluster 2. Network Failure 1. Corosync on both nodes detects separation - 2. Etcd loses internal quorum (E-quorum) and goes read-only + 2. Internal quorum for etcd (E-quorum) and goes read-only 3. Both sides retain C-quorum and initiate fencing of the other side. RHEL-HA's fencing priority avoids parallel fencing operations and thus the total shutdown of the system. 4. One side wins, pre-configured as Node1 @@ -319,7 +409,7 @@ This section provides specific steps for how two-node clusters would handle inte 14. Cluster continues with 1+1 redundancy 3. Node Failure 1. Corosync on the survivor (Node1) - 2. Etcd loses internal quorum (E-quorum) and goes read-only + 2. Internal quorum for etcd (E-quorum) and goes read-only 3. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 4. Pacemaker on Node1 restarts etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 5. Cluster continues with no redundancy @@ -336,9 +426,9 @@ This section provides specific steps for how two-node clusters would handle inte 4. Two Failures 1. Node2 failure (1st failure) 2. Corosync on the survivor (Node1) - 3. Etcd loses internal quorum (E-quorum) and goes read-only + 3. Internal quorum for etcd (E-quorum) and goes read-only 4. Node1 retains “corosync quorum” (C-quorum) and initiates fencing of Node2 - 5. Pacemaker on Node1 restarts Etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. + 5. Pacemaker on Node1 restarts etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 6. Cluster continues with no redundancy 7. Node1 experience a power failure (2nd Failure) 8. … time passes … @@ -351,7 +441,7 @@ This section provides specific steps for how two-node clusters would handle inte 2. Pacemaker restarts kubelet 3. Stop failure is optionally escalated to a node failure (fencing) 4. Start failure defaults to leaving the service offline -6. Etcd Failure +6. Failure in etcd 1. Pacemaker’s monitoring detects the failure 2. Pacemaker removes etcd from the members list and restart it, so it can resync 3. Stop failure is optionally escalated to a node failure (fencing) @@ -371,11 +461,11 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i 1. Mitigation: RHEL-HA requires fencing of a presumed dead peer before restarting etcd as a cluster-of-one 2. Mitigation: Peers remain inert (unable to fence peers, or start cri-o, kubelet, or etcd) after rebooting until they can contact their peer -2. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would cause an internal split-brain +2. Risk: Multiple entities (RHEL-HA, CEO) attempting to manage etcd membership would create multiple containers competing to control the same ports and database files 1. Mitigation: The CEO will run in a mode that does manage not etcd membership 3. Risk: Other operators that perform power-management functions could conflict with pacemaker. - 1. Mitigation: Update the Baremetal and Node Health Check operators to ensure control plane nodes can not perform power operations in the 2-node topology. + 1. Mitigation: Update the Baremetal and Node Health Check operators to ensure control-plane nodes can not perform power operations for the control-plane nodes in the two-node topology. 4. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites 1. Mitigation: Lifecycle events, such as upgrades and applying new `MachineConfig`s, are not permitted in a single-node degraded state @@ -385,7 +475,7 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i 5. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. - This may require an additional enhancement. + Most OpenShift-trigger events, such as upgrades and MCO-triggered restarts, should follow the logic described above for graceful reboots, which should result in minimal disruption. 6. Risk: We may not succeed in identifying all the reasons a node will reboot 1. Mitigation: ... testing? ... @@ -404,8 +494,35 @@ Satisfying this demand would come with significant technical and support overhea ## Open Questions [optional] 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? + 2. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? +3. Can we do pacemaker initialization without the introduction of a new operator? + + We've talked over the pros and cons of a new operator to handle aspects of the 2NO setup. The primary job of a 2NO setup operator would be to initialize pacemaker and + to ensure that it reaches a healthy state. This becomes a simple way of kicking off the transition from CEO controlled etcd to RHEL-HA controlled etcd. As an operator, + it can also degrade during installation to ensure that installation fails if fencing credentials are invalid or the etcd containers cannot be started. The last benefit is + that the operator could later be used to communicate information about pacemaker to a cluster admin in case the resource and/or fencing agents become unhealthy. + + After some discussion, we're prioritizing an exploration of a solution to this initialization without introducing a new operator. The operator that is closest in scope + to pacemaker initialization is the cluster-etcd-operator. Ideally, we could have it be responsible for kicking off the initialization of pacemaker, since the core of a + successful 2NO setup is to ensure etcd ownership is transitioned to a healthy RHEL-HA deployment. While it is a little unorthodox for a core operator to initialize an external + component, that component is tightly coupled with the health of etcd to begin with and they benefit from being deployed and tested together. Additionally, most cases that + would result in pacemaker failing to initialize would result in CEO being degraded as well. One concern raised for this approach is that we may introduce a greater security + risk since CEO permissions need to be elevated so that a container can run as root to initialize pacemaker. The other challenge to solve with this approach is how we + communicate problems discovered by pacemaker to the user. + +4. How do we notify the user of problems found by pacemaker? + + Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to + a cluster admin? A simple example of this would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this + error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a 2NO setup operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing cluster health APIs to see if something suitable can be reused. + +5. How do we handle updates to the etcd pod? + + Things like certificate rotations and image updates will necessitate updates to the pacemaker-controlled etcd pod. We will need to introduce some kind of mechanism + where CEO can describe the changes that need to happen and trigger an image update. We might be able to leverage [podman play kube](https://docs.podman.io/en/v4.2/markdown/podman-play-kube.1.html) to map the static pod definition to a container, but we will need to find a way to get CEO to render what would usually be the contents of the static pod config to somewhere pacemaker can see updates and respond to them. + ## Test Plan **Note:** *Section not required until targeted at a release.* @@ -422,7 +539,7 @@ The initial release of 2NO should aim to build a regression baseline. | Test | Node failure [^2] | A new 2NO test to detect if the cluster recovers if a node crashes. | | Test | Network failure [^2] | A new 2NO test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | | Test | Kubelet failure [^2] | A new 2NO test to detect if the cluster recovers if kubelet fails. | -| Test | Etcd failure [^2] | A new 2NO test to detect if the cluster recovers if etcd fails. | +| Test | Failure in etcd [^2] | A new 2NO test to detect if the cluster recovers if etcd fails. | [^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the topology. @@ -450,7 +567,7 @@ Additionally, it would be good to have workload-specific testing once those are ### Dev Preview -> Tech Preview -- Ability to install a 2-node cluster using assisted installer (via ACM) and agent-based installer +- Ability to install a two-node cluster using assisted installer (via ACM) and agent-based installer - End user documentation, relative API stability - Sufficient test coverage (see test plan above) @@ -459,7 +576,8 @@ Additionally, it would be good to have workload-specific testing once those are - Working upgrades - Upgrade tests - Available by default -- Backhaul SLI telemetry +- Documentation for replacing a failed control-plane node +- Documentation for post-installation fencing validation - Performance testing - User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) @@ -499,48 +617,40 @@ is ensuring the cluster stays functional and consistent through the reboots of t The other potential impact is around reboots. There may be a small performance impact when the nodes reboot since they have to leave the etcd cluster and resync etcd to join. -- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or - automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review - this enhancement) +- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review this enhancement) The impact of the etcd transition as well as the reboot interactions with etcd are likely compatible with existing SLIs. - Describe the possible failure modes of the API extensions. There shouldn't be any failures introduced by adding a new topology. - Etcd transitioning from CEO-managed to externally managed should be a one-way switch. If it fails, it should result in a failed installation. - Similarly, introducing an operator to remediate the fencing agent exists so that we can degrade the operator if the fencing credentials are invalid or if fencing cannot be initialized, - leading to a failed installation. The special case is if the BMC credentials are rotated. Here, the operator's role is to notify the cluster admin that the pacemaker fencing agent is unhealthy. - - If the administrator fails to remediate this before a fencing operation is required, then manual recovery will be required by SSH-ing to the node. + Transitioning etcd from CEO-managed to externally managed should be a one-way switch, verified by a ValidatingAdmissionPolicy. If it fails, it should result in a failed installation. + This is also true of the initial fencing setup. If pacemaker cannot be initialized, the cluster installation should ideally fail. + If the BMC access starts failing later in the cluster lifecycle and the administrator fails to remediate this before a fencing operation is required, then manual recovery will be required by SSH-ing to the node. - Describe how a failure or behaviour of the extension will impact the overall cluster health (e.g. which kube-controller-manager functionality will stop working), especially regarding stability, availability, performance, and security. - As mentioned above, a BMC credential rotation could result in a customer "breaking" pacemaker's ability to fence nodes. On its own, pacemaker has no way of communicating this kind - of failure to the cluster admin. The proposed 2NO resource agent remediation operator could detect this and warn the cluster administrator that action is required. Alternatively, - RHEL-HA could detect this and immediately respond by turning off etcd for one of the nodes - forcing the cluster into read-only mode until a user manually logs in and restores the fencing agent. - To minimize disruption to user workloads, it would be best to avoid leaving the cluster in a read-only state unless we enter a state where fencing would be required for the cluster - to recover. + As mentioned above, a network outage or a BMC credential rotation could result in a customer "breaking" pacemaker's ability to fence nodes. On its own, pacemaker has no way of communicating this kind of failure to the cluster admin. Whether this can be raised to OpenShift via monitoring or something similar remains an open question. - Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes and add them as reviewers to this enhancement. - In case of escalation, the most likely team affected outside of Edge Enablement (or whoever owns the proposed topology) is the etcd team because forcing the cluster into a read-only - state is the primary mechanism two-node can use to protect against data corruption/loss. + In case of escalation, the most likely team affected outside of Edge Enablement (or whoever owns the proposed topology) is the Control Plane team because etcd is the primary component that two-node OpenShift needs to manage properly to protect against data corruption/loss. ## Support Procedures - Failure logs for pacemaker will be available in the system journal. The installer should report these in the case that a cluster cannot successfully initialize pacemaker. -- A BMC connection failure detected after the cluster is installed can be remediated as long as the cluster is healthy and can be done by updating the CRD for the proposed 2NO resource agent remediation operator. +- A BMC connection failure detected after the cluster is installed can be remediated as long as the cluster is healthy. A new MachineConfig can be applied to update the secrets file. If the cluster + is down, this file would need to be updated manually. - In the case of a failed two-node cluster, there is no supported way of migrating to a different topology. The most practical option would be to deploy a fresh environment. ## Alternatives * MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes -* 2 SNO + KCP +* 2 SNO + KCP: [KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. With kcp, you can manage the two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. The main advantage of this approach is that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. @@ -551,4 +661,4 @@ Disadvantages: ## Infrastructure Needed [optional] -A new repository in the OpenShift GitHub organization will be created for the 2NO resource agent remediation operator. +A new repository in the OpenShift GitHub organization will be created for the 2NO setup operator if we decide to proceed with this design. From 32c5dc5464e3be15af7406f29c8fcd3de3b5112d Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Fri, 17 Jan 2025 10:11:47 -0500 Subject: [PATCH 39/49] Updated 2NO acronym to TNF --- .../2no.md => two-node-fencing/tnf.md} | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) rename enhancements/{two-nodes-openshift/2no.md => two-node-fencing/tnf.md} (97%) diff --git a/enhancements/two-nodes-openshift/2no.md b/enhancements/two-node-fencing/tnf.md similarity index 97% rename from enhancements/two-nodes-openshift/2no.md rename to enhancements/two-node-fencing/tnf.md index d043769702..aa000e3f49 100644 --- a/enhancements/two-nodes-openshift/2no.md +++ b/enhancements/two-node-fencing/tnf.md @@ -1,5 +1,5 @@ --- -title: 2no +title: tnf authors: - "@mshitrit" - "@jaypoulz" @@ -30,7 +30,7 @@ tracking-link: - https://issues.redhat.com/browse/OCPSTRAT-1514 --- -# Two Nodes Openshift (2NO) - Control Plane Availability +# Two Node Fencing (TNF) ## Terms @@ -152,7 +152,7 @@ Three aspects of cluster creation need to happen for a vanilla two-node cluster ###### Transitioning etcd Management to RHEL-HA An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by RHEL-HA. The basic transition works as follows: 1. [MCO extensions](https://docs.openshift.com/container-platform/4.17/machine_configuration/machine-configs-configure.html#rhcos-add-extensions_machine-configs-configure) are used to ensure that the pacemaker and corosync RPMs are installed. The installer also creates MachineConfig manifests to pre-configure resource agents. -2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or a new 2NO setup operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or a new TNF setup operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. 3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. 4. The installation proceeds as normal once the pods start. If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. @@ -312,7 +312,7 @@ sshKey: '' ### Topology Considerations -2NO represents a new topology and is not appropriate for use with HyperShift, SNO, or MicroShift +TNF represents a new topology and is not appropriate for use with HyperShift, SNO, or MicroShift #### Standalone Clusters @@ -328,7 +328,7 @@ So far, we've discovered topology-sensitive logic in ingress, authentication, CE The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster 2NO setup operator or a function of CEO triggering a script on one of the control-plane nodes. +Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster TNF setup operator or a function of CEO triggering a script on one of the control-plane nodes. This script needs to be run with root permissions, so this is another factor to consider when evaluating if a new in-cluster operator is needed. Regardless, this initialization will require that RedFish details have been collected by the installer and synced to the nodes. @@ -349,7 +349,7 @@ Tools for extracting support information (must-gather tarballs) will be updated As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. The etcd pods are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. -#### 2NO Setup Operator +#### TNF Setup Operator From a high level, the proposed 2 setup operator's job is to ensure that the RHEL-HA components can be initialized with agents (resource, fencing, etc.). The most involved aspect of this is triggering the pacemaker initialization script. It is an open question as to whether this should be a mechanism leveraged to notify @@ -363,7 +363,7 @@ Outside of potentially reusing the networking bits of `platform: baremetal`, we This means that the Baremetal Operator is not initially in scope for a two-node cluster because we don't intend to support compute nodes. However, if this requirement were to change for future business opportunities, it may still be useful to provide the user with an install-time option for deploying the Baremetal Operator. -Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target 2NO clusters on both `platform: none` and platform `platform: baremetal` clusters. +Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both `platform: none` and platform `platform: baremetal` clusters. For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future. @@ -499,14 +499,14 @@ Satisfying this demand would come with significant technical and support overhea 3. Can we do pacemaker initialization without the introduction of a new operator? - We've talked over the pros and cons of a new operator to handle aspects of the 2NO setup. The primary job of a 2NO setup operator would be to initialize pacemaker and + We've talked over the pros and cons of a new operator to handle aspects of the TNF setup. The primary job of a TNF setup operator would be to initialize pacemaker and to ensure that it reaches a healthy state. This becomes a simple way of kicking off the transition from CEO controlled etcd to RHEL-HA controlled etcd. As an operator, it can also degrade during installation to ensure that installation fails if fencing credentials are invalid or the etcd containers cannot be started. The last benefit is that the operator could later be used to communicate information about pacemaker to a cluster admin in case the resource and/or fencing agents become unhealthy. After some discussion, we're prioritizing an exploration of a solution to this initialization without introducing a new operator. The operator that is closest in scope to pacemaker initialization is the cluster-etcd-operator. Ideally, we could have it be responsible for kicking off the initialization of pacemaker, since the core of a - successful 2NO setup is to ensure etcd ownership is transitioned to a healthy RHEL-HA deployment. While it is a little unorthodox for a core operator to initialize an external + successful TNF setup is to ensure etcd ownership is transitioned to a healthy RHEL-HA deployment. While it is a little unorthodox for a core operator to initialize an external component, that component is tightly coupled with the health of etcd to begin with and they benefit from being deployed and tested together. Additionally, most cases that would result in pacemaker failing to initialize would result in CEO being degraded as well. One concern raised for this approach is that we may introduce a greater security risk since CEO permissions need to be elevated so that a container can run as root to initialize pacemaker. The other challenge to solve with this approach is how we @@ -516,7 +516,7 @@ Satisfying this demand would come with significant technical and support overhea Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to a cluster admin? A simple example of this would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this - error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a 2NO setup operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing cluster health APIs to see if something suitable can be reused. + error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a TNF setup operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing cluster health APIs to see if something suitable can be reused. 5. How do we handle updates to the etcd pod? @@ -528,18 +528,18 @@ Satisfying this demand would come with significant technical and support overhea **Note:** *Section not required until targeted at a release.* ### CI -The initial release of 2NO should aim to build a regression baseline. +The initial release of TNF should aim to build a regression baseline. | Type | Name | Description | | ----- | ----------------------------- | --------------------------------------------------------------------------- | | Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | | Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | | Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | -| Suite | 2NO Recovery | This is a new suite consisting of the tests listed below. | -| Test | Node failure [^2] | A new 2NO test to detect if the cluster recovers if a node crashes. | -| Test | Network failure [^2] | A new 2NO test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | -| Test | Kubelet failure [^2] | A new 2NO test to detect if the cluster recovers if kubelet fails. | -| Test | Failure in etcd [^2] | A new 2NO test to detect if the cluster recovers if etcd fails. | +| Suite | TNF Recovery | This is a new suite consisting of the tests listed below. | +| Test | Node failure [^2] | A new TNF test to detect if the cluster recovers if a node crashes. | +| Test | Network failure [^2] | A new TNF test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | +| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. | +| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. | [^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the topology. @@ -547,7 +547,7 @@ topology. being restarted mid-test. ### QE -This section outlines test scenarios for 2NO. +This section outlines test scenarios for TNF. | Scenario | Description | | ----------------------------- | ----------------------------------------------------------------------------------- | @@ -648,7 +648,7 @@ is ensuring the cluster stays functional and consistent through the reboots of t ## Alternatives -* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the 2NO initiative which is on getting the OpenShift experience on two nodes +* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the TNF initiative which is on getting the OpenShift experience on two nodes * 2 SNO + KCP: [KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. @@ -661,4 +661,4 @@ Disadvantages: ## Infrastructure Needed [optional] -A new repository in the OpenShift GitHub organization will be created for the 2NO setup operator if we decide to proceed with this design. +A new repository in the OpenShift GitHub organization will be created for the TNF setup operator if we decide to proceed with this design. From 24be9f2c74f279f6e81ebaf2ca485e9a8e57a6a3 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Mon, 13 Jan 2025 16:03:57 -0500 Subject: [PATCH 40/49] OCPBUGS-1460: Update 2NO EP with relevant structure and content from OLA EP. In this commit, I ran through all of the content in the OLA EP (https://github.com/openshift/enhancements/pull/1674/files) and brought over any details or structure that I found relevant or helpful for the 2-node proposal. I also tweaked some sections with formatting updates and updated some of the overall content to reflect the current plan and understanding. --- enhancements/two-node-fencing/tnf.md | 279 ++++++++++++++++++--------- 1 file changed, 191 insertions(+), 88 deletions(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index aa000e3f49..37abadda46 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -34,29 +34,29 @@ tracking-link: ## Terms -RHEL-HA - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. +**RHEL-HA** - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. -Corosync - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. +**Corosync** - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. -Pacemaker - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover for critical services and applications. +**Pacemaker** - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover for critical services and applications. -Fencing - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. +**Fencing** - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. -Quorum - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. +**Quorum** - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. * C-quorum: quorum as determined by Corosync members and algorithms * E-quorum: quorum as determined by etcd members and algorithms -Split-brain - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically, it results in data loss or corruption. +**Split-brain** - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically, it results in data loss or corruption. -MCO - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. +**MCO** - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. -ABI - Agent-Based Installer. +**ABI** - Agent-Based Installer. A installation path through the core-installer that leverages the assisted service to facilate baremetal installations. Can be used in disconnected environments. -BMO - Baremetal Operator +**BMO** - Baremetal Operator. An optional operator whose primary function is to provide the ability to scale clusters in baremetal deployment environments. -CEO - cluster-etcd-operator +**CEO** - cluster-etcd-operator. The OpenShift operator responsible for deploying and maintaining healthy etcd instances for the cluster. -BMC - Baseboard Management Console. Used to manage baremetal machines. Can modify firmware settings and machine power state. +**BMC** - Baseboard Management Console. Used to manage baremetal machines. Can modify firmware settings and machine power state. ## Summary @@ -73,31 +73,35 @@ This requires our solution to provide a management experience consistent with "n ### User Stories -* As a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. -* As a support engineer, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized. -* As an enterprise running workloads on a minimal OpenShift footprint, I want to minimize time-to-recovery and data loss for my workloads when a node fails. +* As a solutions architect for a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. +* As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to leverage the full capacity of both control-plane nodes to run my workloads. +* As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to minimize time-to-recovery and data loss for my workloads when a node fails. +* As an OpenShift cluster administrator, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized and the cluster fully recovers. ### Goals * Provide a two-node control-plane for physical hardware that is resilient to a node-level failure for either node * Provide a transparent installation experience that starts with exactly 2 blank physical nodes, and ends with a fault-tolerant two-node cluster * Prevent both data corruption and divergent datasets in etcd -* Minimize recovery-caused unavailability. Eg. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability. +* Minimize recovery-caused unavailability (e.g. by avoiding fencing loops, wherein each node powers cycles its peer after booting, reducing the cluster's availability) * Recover the API server in less than 120s, as measured by the surviving node's detection of a failure * Minimize any differences to existing OpenShift topologies -* Avoid any decisions that would prevent future implementation and support for upgrade/downgrade paths between two-node and traditional architectures -* Provide an OpenShift cluster experience that is similar to that of a 3-node hyperconverged cluster but with 2 nodes +* Provide an OpenShift cluster experience as similar to that of a 3-node hyperconverged cluster as can be achieved accepting the resiliency compromises of having 2 nodes +* Minimize scenarios that require manual intervention to initiate cluster recovery +* Provide a mechanism for cluster components (e.g. CVO operators, OLM operators) to detect this topology and behave contextually for its limitations ### Non-Goals +* Achieving the same level of resilience or guarantees provided by a cluster with 3 control-plane nodes or 2 nodes with an arbiter * Workload resilience - see related [Pre-DRAFT enhancement](https://docs.google.com/document/d/1TDU_4I4LP6Z9_HugeC-kaQ297YvqVJQhBs06lRIC9m8/edit) * Resilient storage - see future enhancement -* Support for platforms other than bare metal including automated CI testing +* Support for platforms other than `platform: "None"` and `platform: "Baremetal"` * Support for other topologies (eg. hypershift) * Support disconnected cluster installation * Adding worker nodes -* Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (Deferred to post-MVP) -* Supporting upgrade/downgrade paths between two-node and other topologies (e.g. 3-node compact) (for initial release) +* Supporting upgrade/downgrade paths between 2-node and other topologies (e.g. single-node, 3-node, 2-node with arbiter) (highly requested future extension) +* Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (future extension) +* Support for IBI (image-based install) and IBU (image-based upgrade) ## Proposal @@ -123,6 +127,26 @@ This functionality exists within RHEL-HA, but a wrapper will be provided to take When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. + +### Summary of Changes + +At a glance, here are the components we are proposing to change: +| Component | Change | +| ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| [Feature Gates](#feature-gate-changes) | Add a new `DualReplicaTopology` feature which can be enabled via the `CustomNoUpgrade` feature set | +| [Infrastructure API](#infrastructure-api-changes) | Add `DualReplica` as a new value for `ControlPlaneTopology` | +| [ETCD Operator](#etcd-operator-changes) | Add mode for disabling management of the etcd container, scaling strategy for 2 nodes, and a controller for initializing pacemaker | +| [Install Config](#install-config-changes) | Update install config API to accept fencing credentials for `platform: None` and `platform: Baremetal` | +| [Installer](#installer-changes) | Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter | +| [MCO](#mco-changes) | Add an MCO extension for installating pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 | +| [Authentication Operator](#authentication-operator-changes) | Update operator to accept minimum 1 kube api servers when `ControlPlaneTopology` is `DualReplica` | +| [Hosted Control Plane](#hosted-control-plane-changes) | Disallow HyperShift from installing on the `DualReplica` topology | +| [OLM Filtering](#olm-filtering-changes) | Leverage support for OLM to filter operators based off of control plane topology | +| [Assisted Installer Family](#assisted-installer-family-changes) | Add support for deploying baremetal clusters with 2 control-plane nodes using Assisted Installer and Agent-Based Installer | +| [Baremetal Operator](#baremetal-operator-changes) | Prevent power-management of control-plane nodes when the infrastructureTopology is set to `DualReplica` | +| [Node Health Check Operator](#node-health-check-operator-changes) | Prevent fencing of control-plane nodes when the infrastructureTopology is set to `DualReplica` | + + ### Workflow Description #### Cluster Creation @@ -135,7 +159,7 @@ A critical transition during bootstrapping is when the bootstrap reboots into th Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. To constrain the scope of support, we've targeted Assisted Installer (in ACM) and Agent-Based Installer (ABI) as our supported installation paths. Support for other installation paths -may be reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via the openshift-installer using an +may be reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via `openshift-install` using an auxiliary bootstrap node but we don't intend to support this for customers unless this becomes a business requirement. Similarly, ZTP may be evaluated as a future offering for clusters deployed by ACM environments via Multi-Cluster Engine (MCE), Assisted Installer, and Baremetal Operator. @@ -152,12 +176,10 @@ Three aspects of cluster creation need to happen for a vanilla two-node cluster ###### Transitioning etcd Management to RHEL-HA An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by RHEL-HA. The basic transition works as follows: 1. [MCO extensions](https://docs.openshift.com/container-platform/4.17/machine_configuration/machine-configs-configure.html#rhcos-add-extensions_machine-configs-configure) are used to ensure that the pacemaker and corosync RPMs are installed. The installer also creates MachineConfig manifests to pre-configure resource agents. -2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (see open questions regarding whether this should be handled by CEO or a new TNF setup operator) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. -3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd pod from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd pods to be gone so they can restart them using Podman. -4. The installation proceeds as normal once the pods start. -If for some reason, the etcd pods cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. - -There is an open question regarding how to handle updates to the etcd pod definition if it needs to change or if certificates are rotated. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (likely a new controller running in CEO) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd container from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd containers to be gone so they can restart them using Podman. +4. The installation proceeds as normal once the containers start. +If for some reason, the etcd containers cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. ###### Configuring Fencing Via MCO Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. @@ -190,21 +212,22 @@ Confirmation can be given at any point and optionally make use of SSH to facilit ### API Extensions -Three known capabilities require API extensions. +The full list of changes proposed is [summarized above](#summary-of-changes). Each change is broken down in greater detail below. + +#### Feature Gate Changes -1. Identifying two-node control-plane clusters as a unique topology -2. Telling CEO when it is safe for it to disable certain membership-related functionalities -3. Collecting fencing credentials for pacemaker initialization in the install-config +We will define a new `DualReplicaTopology` feature that can be enabled in `install-config.yaml` to ensure the clusters running this feature cannot be upgraded. -#### Unique Topology +#### Infrastructure API Changes A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling. -We will define a new value for the `TopologyMode` enum: `DualReplica`. -The enum is used for the `controlPlaneTopology` and `infrastructureTopology` fields, and the currently supported values are `HighlyAvailable`, `SingleReplica`, and `External`. +We will define a new value for the `infrastructureTopology` field of the Infrastructure config's `TopologyMode` enum. +Specifically, the value of `DualReplica` will be added to the currently supported list, which includes `HighlyAvailable`, `SingleReplica`, and `External`. -We will additionally define a new feature gate `DualReplicaTopology` that can be enabled in `install-config.yaml` to ensure the feature can be set as `TechPreviewNoUpgrade`. +InfrastructureTopology is assumed to be immutable by components of the cluster. While there is strong interest in changing this to allow for topology +transitions to and from TNF, this is beyond the scope of this enhancement proposal and should be detailed in its own enhancement. -#### CEO Externally Managed etcd +#### etcd Operator Changes Initially, the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. @@ -216,7 +239,12 @@ This will allow the use of a credential scoped to `ConfigMap`s in the `openshift The plan is for this to be changed by one of the nodes during pacemaker initialization. Pacemaker initialization should be initiated by CEO when it detects that the cluster controlPlane topology is set to `DualReplica`. -#### Install Config with Fencing Credentials +While set to `External`, CEO will still need to render the configuration for the etcd container in a place where it can be consumed by pacemaker. This ensures that the etcd instance managed by pacemaker can be updated +accordingly in the case of a upgrade event or whenever certificates are rotated. + +#### Install Config Changes + +In order to initialize pacemaker with valid fencing credentials, they will be consumed by the installer via the installation config and created on the cluster as a cluster secret. A sample install-config.yaml for `platform: none` type clusters would look like this: ``` @@ -310,14 +338,66 @@ pullSecret: '' sshKey: '' ``` +#### Installer Changes +Aside from the changes mentioned above detailing the new install config, the installer will be also be responsible for detecting a valid two-node footprint +(one that specifies two control-plane nodes and zero arbiters) and populating the nodes with the configuration files and resource scripts needed by pacemaker. +t will also need update the cluster's FeatureGate CR in order to enable the `CustomNoUpgrade` feature set with the `DualReplicaTopology` feature. + +A cluster is assumed to be a 2-node cluster if the following statements are true: +1. The number of control-plane replicas is set to 2 +2. The number of arbiter replicas is set to 0 + +The number of compute nodes is also expected to be zero, but this will not be enforced. + +Additionally, we will enforce that 2-node clusters are only allowed on platform `None` or platform `Baremetal`. This is not a technical restriction, but +rather one that stems from a desire to limit the number of supportable configurations. If use cases emerge, cloud support for this topology may be considered +in the future. + +#### MCO Changes +The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.19 timeframe. +A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. + +Additionally, in order to ensure the cluster can upgrade safely, the MachineConfigPool `maxUnavailable` control-plane nodes will be set to 1. This should prevent +upgrades from trying to proceed if a node is unavailable. + +#### Authentication Operator Changes +The authentication operator is sensitive to the number of kube-api replicas running in the cluster for +[test stability reasons](https://github.com/openshift/cluster-authentication-operator/commit/a08be2324f36ce89908f695a6ff3367ad86c6b78#diff-b6f4bf160fd7e801eadfbfac60b84bd00fcad21a0979c2d730549f10db015645R158-R163). +To get around this, it runs a [readiness check](https://github.com/openshift/cluster-authentication-operator/blob/2d71f164af3f3e9c84eb40669d330df2871956f5/pkg/controllers/readiness/unsupported_override.go#L58) +that needs to be updated to allow for a minimum of two replicas available when using the `DualReplica` infrastructure topology to ensure test stability. + +#### Hosted Control Plane Changes +Two-node clusters are no compatible with hosted control planes. A check will be needed to disallow hosted control planes when the infrastructureTopology is set +to `DualReplica`. + +#### OLM Filtering Changes +Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include +the `DualReplica` infrastructureTopology as an option that operators can levarage to communicate cluster-compatibility. + +#### Assisted Installer Family Changes +In order to achieve the requirement for deploying two-node openshift with only two-nodes, we will add support for installing using 2 nodes in the Assisted and +Agent-Based installers. The core of this change is an update to the assisted-service validation. Care must be taken to ensure that the SaaS offering is not +included in this, since we do not wish to store customer fencing credentials in a Red Hat database. In other words, installing using the Assisted Installer will +only be supported using MCE. + +#### BareMetal Operator Changes +Because paceamaker is the only entity allowed to take fencing actions on the control-plane nodes, the baremetal operator will need to be updated to ensure that `BareMetalHost` entries cannot be added for the control-plane nodes. +This will prevent power management operations from being triggered outside of the purview of pacemaker. + +#### Node Health Check Operator Changes +The Node Health Check operator allows a user to create fencing requests, which are in turn remediated by other optional operators. To ensure that the control-plane nodes +cannot end up in a reboot deadlock between fencing agents, the Node Health Check operator should be rendered inert on this topology. Another approach would +be to enforce this limitation for just the control-plane nodes, but this proposal assumes that no dedicated compute nodes will be run in this topology. + ### Topology Considerations TNF represents a new topology and is not appropriate for use with HyperShift, SNO, or MicroShift #### Standalone Clusters -Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. To achieve this, we will need to enable the installation -of two-node clusters via the Agent-Based Installer to ensure that we are still meeting the installation requirement of using only 2 nodes. +Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. To achieve this, we +will need to enable the installation of two-node clusters via the Agent-Based Installer to ensure that we are still meeting the installation requirement of using +only 2 nodes. ### Implementation Details/Notes/Constraints @@ -325,9 +405,6 @@ While the target installation requires exactly 2 nodes, this will be achieved by So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology. -The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.18 and 4.19 timeframes. -A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. - Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster TNF setup operator or a function of CEO triggering a script on one of the control-plane nodes. This script needs to be run with root permissions, so this is another factor to consider when evaluating if a new in-cluster operator is needed. Regardless, this initialization will require that RedFish details have been collected by the installer and synced to the nodes. @@ -347,13 +424,7 @@ More information on creating OCF agents can be found in the upstream [developer Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. -The etcd pods are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. - -#### TNF Setup Operator - -From a high level, the proposed 2 setup operator's job is to ensure that the RHEL-HA components can be initialized with agents (resource, fencing, etc.). -The most involved aspect of this is triggering the pacemaker initialization script. It is an open question as to whether this should be a mechanism leveraged to notify -the user if one or more of these agents is unhealthy. +The etcd containers are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. #### Platform None vs. Baremetal One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. @@ -478,15 +549,25 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i Most OpenShift-trigger events, such as upgrades and MCO-triggered restarts, should follow the logic described above for graceful reboots, which should result in minimal disruption. 6. Risk: We may not succeed in identifying all the reasons a node will reboot - 1. Mitigation: ... testing? ... + 1. Mitigation: So far, we've classified all failure events on this topology as belonging to the expected (i.e. graceful) reboot flow, or the unexpected (i.e. ungraceful) + reboot flow. Should a third classification emerge during testing, we will expand out test to include this and enumerate the events that fall into that flow. 7. Risk: This new platform will have a unique installation flow 1. Mitigation: A new CI lane will be created for this topology +8. Risk: PodDisruptionBudgets behave incorrectly on this topology. + 1. Mitigation: We plan to review each of the PDBs set in the cluster and set topology appropriate configuration for each. + ### Drawbacks -The two-node architecture represents yet another distinct install type for users to choose from. +The two-node architecture represents yet another distinct install type for users to choose from, and therefore another addition to the test matrix +for baremetal installation variants. Because this topology has so many unique failure recovery paths, it also requires an in-depth new test +suite which can be used to exercise all of these failure recovery scenarios. + +More critically, this is the only variant of OpenShift that would recommend a regular maintenance check to ensure that failures that require +fencing result in automatic recovery. Conveying the importance of a regular disaster recovery readiness checks will be an interesting challenge +for the user experience. The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. Satisfying this demand would come with significant technical and support overhead which is out of scope for this enhancement. @@ -495,6 +576,8 @@ Satisfying this demand would come with significant technical and support overhea 1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? + So far, we haven't found any of these. Normal lifecycle events can be handled cleanly through a the graceful recovery flow. + 2. In the test plan, which subset of layered products needs to be evaluated for the initial release (if any)? 3. Can we do pacemaker initialization without the introduction of a new operator? @@ -516,12 +599,16 @@ Satisfying this demand would come with significant technical and support overhea Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to a cluster admin? A simple example of this would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this - error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a TNF setup operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing cluster health APIs to see if something suitable can be reused. + error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a TNF setup + operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through + existing cluster health APIs to see if something suitable can be reused. -5. How do we handle updates to the etcd pod? +5. How do we handle updates to the etcd container? - Things like certificate rotations and image updates will necessitate updates to the pacemaker-controlled etcd pod. We will need to introduce some kind of mechanism - where CEO can describe the changes that need to happen and trigger an image update. We might be able to leverage [podman play kube](https://docs.podman.io/en/v4.2/markdown/podman-play-kube.1.html) to map the static pod definition to a container, but we will need to find a way to get CEO to render what would usually be the contents of the static pod config to somewhere pacemaker can see updates and respond to them. + Things like certificate rotations and image updates will necessitate updates to the pacemaker-controlled etcd container. We will need to introduce some kind of mechanism + where CEO can describe the changes that need to happen and trigger an image update. We might be able to leverage + [podman play kube](https://docs.podman.io/en/v4.2/markdown/podman-play-kube.1.html) to map the static pod definition to a container, but we will need to find + a way to get CEO to render what would usually be the contents of the static pod config to somewhere pacemaker can see updates and respond to them. ## Test Plan @@ -530,33 +617,39 @@ Satisfying this demand would come with significant technical and support overhea ### CI The initial release of TNF should aim to build a regression baseline. -| Type | Name | Description | -| ----- | ----------------------------- | --------------------------------------------------------------------------- | -| Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | -| Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | -| Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | -| Suite | TNF Recovery | This is a new suite consisting of the tests listed below. | -| Test | Node failure [^2] | A new TNF test to detect if the cluster recovers if a node crashes. | -| Test | Network failure [^2] | A new TNF test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | -| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. | -| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. | - -[^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the -topology. -[^2]: These tests will be designed to make a component on the *other* node fail. This should prevent the test pod from -being restarted mid-test. +| Type | Name | Description | +| ----- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------- | +| Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | +| Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | +| Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | +| Job | Serial tests | The standard test suite (openshift/conformance/serial) for establishing a regression baseline between payloads. | +| Job | TechPreview | The standard test suite (openshift/conformance/parallel) run with TechPreview features enabled. | +| Suite | TNF Recovery | This is a new suite consisting of the tests listed below. | +| Test | Double Node failure | A new TNF test to detect if the cluster recovers if both nodes crash and are manually reset | +| Test | Cold boot | A new TNF test to detect if the cluster recovers if both nodes are stopped gracefully and then restarted together | +| Test | Node restart [^2] | A new TNF test to detect if the cluster recovers if a node is gracefully restarted. | +| Test | Node failure [^2] | A new TNF test to detect if the cluster recovers if a node crashes. | +| Test | Network failure [^2] | A new TNF test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | +| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. | +| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. | +| Test | Valid PDBs | A new TNF test to verify that PDBs are set to the correct configuration | +| Test | Conformant recovery | A new TNF test to verify recovery times for failure events are within the creteria defined in the requirements | +| Test | Fencing health check | A new TNF test to verify fencing health check process successful | +| Test | Replacing a control-plane node | A new TNF test to verify that you can replace a control-plane node in a 2-node cluster | + +[^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the topology. +[^2]: These tests will be designed to make a component a randomly selected node fail. ### QE This section outlines test scenarios for TNF. -| Scenario | Description | -| ----------------------------- | ----------------------------------------------------------------------------------- | +| Scenario | Description | +| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | Payload install | A basic evaluation that the cluster installs on supported hardware. Should be run for each supported installation method. | -| Payload upgrade | A basic evaluation that the cluster can upgrade between releases. | -| Performance | Performance metrics are gathered and compared to SNO and Compact HA | -| Scalability | Scalability metrics are gathered and compared to SNO and Compact HA | -| Cold Boot | Verify that clusters can survive a cold boot event. | -| Both nodes crash | Verify that clusters can survive an event where both nodes become unavailable. | +| Payload upgrade | A basic evaluation that the cluster can upgrade between releases. | +| Performance | Performance metrics are gathered and compared to SNO and Compact HA | +| Scalability | Scalability metrics are gathered and compared to SNO and Compact HA | +| TNF Recovery Suite | Ensures all of the 2-node specific behaviors are working as expected | As noted above, there is an open question about how layered products should be treated in the test plan. Additionally, it would be good to have workload-specific testing once those are defined by the workload proposal. @@ -567,23 +660,26 @@ Additionally, it would be good to have workload-specific testing once those are ### Dev Preview -> Tech Preview -- Ability to install a two-node cluster using assisted installer (via ACM) and agent-based installer +- Ability to install a two-node cluster using the core installer via a bootstrap node +- Ability to install a two-node cluster using assisted installer (via ACM) - End user documentation, relative API stability -- Sufficient test coverage (see test plan above) +- CI that shows verifies that nightlies can be installed and pass most conformance tests +- QE verification that the 2-node clusters can handle graceful and ungraceful failures ### Tech Preview -> GA +- Telco input is gathered, documented, and addressed based on feedback +- OLM allows for operator owners to opt-in or out of 2-node topology support +- Ability to install a 2-node cluster using the agent-based installer - Working upgrades -- Upgrade tests -- Available by default +- Available by default (no longer behind feature gate) - Documentation for replacing a failed control-plane node - Documentation for post-installation fencing validation -- Performance testing +- Full test coverage (see CI and QE plans above) +- Backhaul SLI telemetry to track TNF installs +- Document SLOs for the topology - User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) -**For non-optional features moving to GA, the graduation criteria must include -end to end tests.** - ### Removing a deprecated feature - Announce deprecation and support policy of the existing feature @@ -595,12 +691,19 @@ This topology has the same expectations for upgrades as the other variants of Op For tech preview, upgrades will only be achieved by redeploying the machine and its workload. However, fully automated upgrades are a requirement for graduating to GA. +One key detail about upgrades is that they will only be allowed to proceed when both nodes are healthy. +The main challenge with upgrading a 2-node cluster is ensuring the cluster stays functional and consistent +through the reboots of the upgrade. This can be achieved by setting the `maxUnavailable` machines in the +control-plane MachineConfigPool to 1. + Downgrades are not supported outside of redeployment. ## Version Skew Strategy -Most components introduced in this enhancement are external to the cluster itself. The main challenge with upgrading -is ensuring the cluster stays functional and consistent through the reboots of the upgrade. This +The biggest concern with version skew would be incompatibilies between a new version of pacemaker and the currently running resource agents. +Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running +the same versions. It's difficult to image a case where such an incompatibily wouldn't be caught during upgrade testing. This will be something to keep +a close eye on when evaluating upgrade jobs for potential race conditions. ## Operational Aspects of API Extensions @@ -613,7 +716,7 @@ is ensuring the cluster stays functional and consistent through the reboots of t API availability) Toggling CEO control values with result in etcd being briefly offline. The transition is almost immediate, though, since the resource agent is watching for the - etcd pod to disappear so it can start its replacement. + etcd container to disappear so it can start its replacement. The other potential impact is around reboots. There may be a small performance impact when the nodes reboot since they have to leave the etcd cluster and resync etcd to join. From 668f125b1f9269d88a1871751074f7bbd99cda3d Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Tue, 4 Feb 2025 13:41:50 +0200 Subject: [PATCH 41/49] Managing Fencing credentials Signed-off-by: Michael Shitrit --- enhancements/two-node-fencing/tnf.md | 53 +++++++++++----------------- 1 file changed, 20 insertions(+), 33 deletions(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 37abadda46..fbd111c664 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -193,9 +193,19 @@ See the API Extensions section below for sample install-configs. For a two-node cluster to be successful, we need to ensure the following: 1. The BMC secrets for RHEL-HA are created on disk during bootstrapping by the OpenShift installer via a MachineConfig. -2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error and the installation fails. -3. Pacemaker periodically checks that the fencing agent is healthy (i.e. can connect to the BMC) and throws a warning if it cannot access the BMC. There is an open question on what the user experience should be to raise this error to the user. -4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this warning will mean that pacemaker can only provide a best-effort recovery - so operations that require fencing will need manual recovery. +2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error which will cause degradation of the in cluster operator and would fail the installation process. +3. Pacemaker periodically checks that the fencing agent is healthy (i.e. can connect to the BMC) and will create an alert if it cannot access the BMC. + * In this case, in order to allow a simple manual recovery by the user a script will be deployed on the node which will reset Pacemaker with the new fencing credentials. +4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this alert will mean that pacemaker can only provide a best-effort recovery - so operations that require fencing will need manual recovery. + +Future Enhancements +1. Allowing usage of external credentials storage services such as Vault or Conjur. In order to support this: + * Expose the remote access credentials to the in-cluster operator + * We will need an indication for using that particular mode + * Introduce another operator (such as Secrets Store CSI driver) to consume the remote credentials + * Make sure that the relevant operator for managing remote credentials is part of the setup (potentially by using the Multi-Operator-Manager operator) + 2. Allowing refresh of the fencing credentials during runtime. One way to do so would be for the in-cluster operator to watch for a credential change made by the user, and update the credentials stored in Pacemaker upon such a change. + #### Day 2 Procedures @@ -242,6 +252,12 @@ The plan is for this to be changed by one of the nodes during pacemaker initiali While set to `External`, CEO will still need to render the configuration for the etcd container in a place where it can be consumed by pacemaker. This ensures that the etcd instance managed by pacemaker can be updated accordingly in the case of a upgrade event or whenever certificates are rotated. +In case CEO will be used as the in-cluster operator responsible for setting up Pacemaker fencing it'll require root permissions which are currently mandatory to run the required pcs commands. +Some mitigation or alternatives might be: +- Use a different (new) in-cluster operator to set up Pacemaker fencing +- Worth noting that the HA team suggests that there is a plan to adjust the pcs to a full client server architecture which will allow the pcs commands to run without root privileges. A partial faster solution may be provided by the HA team for a specific set of commands used by the pcs client. + + #### Install Config Changes In order to initialize pacemaker with valid fencing credentials, they will be consumed by the installer via the installation config and created on the cluster as a cluster secret. @@ -269,36 +285,7 @@ pullSecret: '' sshKey: '' ``` -For platform baremetal, a valid configuration is quite similar. -``` -apiVersion: v1 -baseDomain: example.com -compute: -- name: worker - replicas: 0 -controlPlane: - name: master - replicas: 2 -metadata: - name: -platform: - baremetal: - fencingCredentials: - bmc: - address: ipmi:// - username: - password: - apiVIPs: - - - ingressVIPs: - - -pullSecret: '' -sshKey: '' -``` - -Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the -Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for -in the installer. +Since Baremetal Operator already has a place to specify bmc credentials there is no need to add them for this setup. ``` apiVersion: v1 baseDomain: example.com From eb0bcb9b68798e2716ad46c953573a3d165a57df Mon Sep 17 00:00:00 2001 From: Michael Shitrit <76515081+mshitrit@users.noreply.github.com> Date: Wed, 5 Feb 2025 10:28:58 +0200 Subject: [PATCH 42/49] Update enhancements/two-node-fencing/tnf.md Co-authored-by: Douglas Hensel --- enhancements/two-node-fencing/tnf.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 37abadda46..76161076fe 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -702,7 +702,7 @@ Downgrades are not supported outside of redeployment. The biggest concern with version skew would be incompatibilies between a new version of pacemaker and the currently running resource agents. Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running -the same versions. It's difficult to image a case where such an incompatibily wouldn't be caught during upgrade testing. This will be something to keep +the same versions. It's difficult to imagine a case where such an incompatibility wouldn't be caught during upgrade testing. This will be something to keep a close eye on when evaluating upgrade jobs for potential race conditions. ## Operational Aspects of API Extensions From cf863b03ad6de4d123dd163c77369afedf25d7cf Mon Sep 17 00:00:00 2001 From: Michael Shitrit <76515081+mshitrit@users.noreply.github.com> Date: Wed, 5 Feb 2025 10:33:02 +0200 Subject: [PATCH 43/49] Update enhancements/two-node-fencing/tnf.md Co-authored-by: Douglas Hensel --- enhancements/two-node-fencing/tnf.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 76161076fe..cdf7a85cdf 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -700,7 +700,7 @@ Downgrades are not supported outside of redeployment. ## Version Skew Strategy -The biggest concern with version skew would be incompatibilies between a new version of pacemaker and the currently running resource agents. +The biggest concern with version skew would be incompatibilities between a new version of pacemaker and the currently running resource agents. Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running the same versions. It's difficult to imagine a case where such an incompatibility wouldn't be caught during upgrade testing. This will be something to keep a close eye on when evaluating upgrade jobs for potential race conditions. From 7729c9a50ca43711aa79821a379fb252d36f981d Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Tue, 11 Feb 2025 11:41:51 +0200 Subject: [PATCH 44/49] Implementing discussions and review feedback. Signed-off-by: Michael Shitrit --- enhancements/two-node-fencing/tnf.md | 37 +++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index fbd111c664..e050b43570 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -195,8 +195,9 @@ For a two-node cluster to be successful, we need to ensure the following: 1. The BMC secrets for RHEL-HA are created on disk during bootstrapping by the OpenShift installer via a MachineConfig. 2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error which will cause degradation of the in cluster operator and would fail the installation process. 3. Pacemaker periodically checks that the fencing agent is healthy (i.e. can connect to the BMC) and will create an alert if it cannot access the BMC. - * In this case, in order to allow a simple manual recovery by the user a script will be deployed on the node which will reset Pacemaker with the new fencing credentials. + * In this case, in order to allow a simple manual recovery by the user a script will be available on the node which will reset Pacemaker with the new fencing credentials. 4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this alert will mean that pacemaker can only provide a best-effort recovery - so operations that require fencing will need manual recovery. +5. When manual recovery is triggered by running the designated script on the node, it'll also update the Secret in order to make sure the Secret is aligned with the credentials kept in Pacemaker's cib file. Future Enhancements 1. Allowing usage of external credentials storage services such as Vault or Conjur. In order to support this: @@ -254,7 +255,8 @@ accordingly in the case of a upgrade event or whenever certificates are rotated. In case CEO will be used as the in-cluster operator responsible for setting up Pacemaker fencing it'll require root permissions which are currently mandatory to run the required pcs commands. Some mitigation or alternatives might be: -- Use a different (new) in-cluster operator to set up Pacemaker fencing +- Use a different (new) in-cluster operator to set up Pacemaker fencing + - However, this approach contradicts the goal of reducing the OCP release payload, as introducing a new core operator would increase its size instead of streamlining it. Additionally, adding the operator would require more effort (release payload changes, CI setup, etc.) compared to integrating an operand into CEO. It would also still require root access, potentially raising similar concerns as using CEO, just with a different audience. - Worth noting that the HA team suggests that there is a plan to adjust the pcs to a full client server architecture which will allow the pcs commands to run without root privileges. A partial faster solution may be provided by the HA team for a specific set of commands used by the pcs client. @@ -285,7 +287,36 @@ pullSecret: '' sshKey: '' ``` -Since Baremetal Operator already has a place to specify bmc credentials there is no need to add them for this setup. +For platform baremetal, a valid configuration is quite similar. +``` +apiVersion: v1 +baseDomain: example.com +compute: +- name: worker + replicas: 0 +controlPlane: + name: master + replicas: 2 +metadata: + name: +platform: + baremetal: + fencingCredentials: + bmc: + address: ipmi:// + username: + password: + apiVIPs: + - + ingressVIPs: + - +pullSecret: '' +sshKey: '' +``` + +Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the +Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for +in the installer. ``` apiVersion: v1 baseDomain: example.com From 56a8d1f8f7b73b8e683be50373c427a88e0e5770 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Wed, 22 Jan 2025 17:56:31 -0500 Subject: [PATCH 45/49] OCPEDGE-1458: [TNF] Addressing post-architecture review feedback Updates include: - Clarifications about the fencing network needing to be separate - Calling out unsuitable work loads (e.g. safety-critical) - Explaining why upgrades only work when both nodes are healthy - Comparing TNF to active-passive SNO - Added test for verifying certification rotation with an unhealthy node - Updated baremetal usage to match https://github.com/metal3-io/metal3-docs/blob/master/design/bare-metal-style-guide.md --- enhancements/two-node-fencing/tnf.md | 153 ++++++++++++++++++--------- 1 file changed, 104 insertions(+), 49 deletions(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 58ff02a502..805abe898e 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -30,7 +30,7 @@ tracking-link: - https://issues.redhat.com/browse/OCPSTRAT-1514 --- -# Two Node Fencing (TNF) +# Two Node OpenShift with Fencing (TNF) ## Terms @@ -50,30 +50,38 @@ tracking-link: **MCO** - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. -**ABI** - Agent-Based Installer. A installation path through the core-installer that leverages the assisted service to facilate baremetal installations. Can be used in disconnected environments. +**ABI** - Agent-Based Installer. A installation path through the core-installer that leverages the assisted service to facilate bare-metal installations. Can be used in disconnected environments. -**BMO** - Baremetal Operator. An optional operator whose primary function is to provide the ability to scale clusters in baremetal deployment environments. +**BMO** - Baremetal Operator. An optional operator whose primary function is to provide the ability to scale clusters in bare-metal deployment environments. **CEO** - cluster-etcd-operator. The OpenShift operator responsible for deploying and maintaining healthy etcd instances for the cluster. -**BMC** - Baseboard Management Console. Used to manage baremetal machines. Can modify firmware settings and machine power state. +**BMC** - Baseboard Management Console. Used to manage bare-metal machines. Can modify firmware settings and machine power state. ## Summary -Leverage traditional high-availability concepts and technologies to provide a container management solution that has a minimal footprint but remains resilient to single node-level failures suitable for customers with numerous geographically dispersed locations. +Leverage traditional high-availability concepts and technologies to provide a container management solution that has a minimal footprint but remains resilient to single +node-level failures suitable for customers with numerous geographically dispersed locations. ## Motivation -Customers with hundreds, or even tens of thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience to node-level failures but does not come with a traditional three-node footprint and/or price tag. +Customers with hundreds, or even tens of thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience +to node-level failures but does not come with a traditional three-node footprint and/or price tag. -The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. +The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of +retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. + +While the degree of resiliency achievable with two nodes is not suitable for safety-critical workloads like emergency services, this proposal aims to deliver a solution +for workloads that can trade off some amount of determinism and reliability in exchange for cost-effective deployments at scale that fully utilize the capacity of both nodes +while minimizing the time to recovery for node-level failures of either node. The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. -This requires our solution to provide a management experience consistent with "normal" OpenShift deployments and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. +This requires our solution to provide a management experience consistent with "normal" OpenShift deployments and be compatible with the full ecosystem of Red Hat and partner +workloads designed for OpenShift. ### User Stories -* As a solutions architect for a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage containers without the overhead of a third node. +* As a solutions architect for a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage applications without incurring the cost of a third node at scale. * As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to leverage the full capacity of both control-plane nodes to run my workloads. * As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to minimize time-to-recovery and data loss for my workloads when a node fails. * As an OpenShift cluster administrator, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized and the cluster fully recovers. @@ -93,6 +101,7 @@ This requires our solution to provide a management experience consistent with "n ### Non-Goals * Achieving the same level of resilience or guarantees provided by a cluster with 3 control-plane nodes or 2 nodes with an arbiter +* Achieving the same level of deterministic failure modes as can be provided by setting up two Single-Node OpenShift instances as an active-passive pair * Workload resilience - see related [Pre-DRAFT enhancement](https://docs.google.com/document/d/1TDU_4I4LP6Z9_HugeC-kaQ297YvqVJQhBs06lRIC9m8/edit) * Resilient storage - see future enhancement * Support for platforms other than `platform: "None"` and `platform: "Baremetal"` @@ -100,6 +109,7 @@ This requires our solution to provide a management experience consistent with "n * Support disconnected cluster installation * Adding worker nodes * Supporting upgrade/downgrade paths between 2-node and other topologies (e.g. single-node, 3-node, 2-node with arbiter) (highly requested future extension) +* Cluster upgrades when only a single node is available * Creation of RHEL-HA events and metrics for consumption by the OpenShift monitoring stack (future extension) * Support for IBI (image-based install) and IBU (image-based upgrade) @@ -142,8 +152,8 @@ At a glance, here are the components we are proposing to change: | [Authentication Operator](#authentication-operator-changes) | Update operator to accept minimum 1 kube api servers when `ControlPlaneTopology` is `DualReplica` | | [Hosted Control Plane](#hosted-control-plane-changes) | Disallow HyperShift from installing on the `DualReplica` topology | | [OLM Filtering](#olm-filtering-changes) | Leverage support for OLM to filter operators based off of control plane topology | -| [Assisted Installer Family](#assisted-installer-family-changes) | Add support for deploying baremetal clusters with 2 control-plane nodes using Assisted Installer and Agent-Based Installer | -| [Baremetal Operator](#baremetal-operator-changes) | Prevent power-management of control-plane nodes when the infrastructureTopology is set to `DualReplica` | +| [Assisted Installer Family](#assisted-installer-family-changes) | Add support for deploying bare-metal clusters with 2 control-plane nodes using Assisted Installer and Agent-Based Installer | +| [Bare Metal Operator](#bare-metal-operator-changes) | Prevent power-management of control-plane nodes when the infrastructureTopology is set to `DualReplica` | | [Node Health Check Operator](#node-health-check-operator-changes) | Prevent fencing of control-plane nodes when the infrastructureTopology is set to `DualReplica` | @@ -152,7 +162,7 @@ At a glance, here are the components we are proposing to change: #### Cluster Creation User creation of a two-node control-plane is possible via the Assisted Installer and the Agent-Based Installer (ABI). The initial implementation will focus on providing support for the Assisted Installer in managed cluster environments (i.e. ACM), followed by stand-alone cluster support via the Agent-Based Installer. -The requirement that the cluster can be deployed using only 2 nodes is key because requiring a third baremetal server for installation can be expensive when deploying baremetal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. +The requirement that the cluster can be deployed using only 2 nodes is key because requiring a third bare-metal server for installation can be expensive when deploying bare metal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. A critical transition during bootstrapping is when the bootstrap reboots into the control-plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained as the machine reboots into a second control-plane. @@ -161,7 +171,7 @@ Otherwise, the procedure follows the standard flow except for the configuration To constrain the scope of support, we've targeted Assisted Installer (in ACM) and Agent-Based Installer (ABI) as our supported installation paths. Support for other installation paths may be reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via `openshift-install` using an auxiliary bootstrap node but we don't intend to support this for customers unless this becomes a business requirement. Similarly, ZTP may be evaluated as a future offering for clusters -deployed by ACM environments via Multi-Cluster Engine (MCE), Assisted Installer, and Baremetal Operator. +deployed by ACM environments via Multi-Cluster Engine (MCE), Assisted Installer, and Bare Metal Operator. Because BMC passwords are being collected to initialize fencing, the Assisted Installer SaaS offering will not be available (to avoid storing customer BMC credentials in a Red Hat database). @@ -184,10 +194,10 @@ If for some reason, the etcd containers cannot be started, then the installation ###### Configuring Fencing Via MCO Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes). -This will take a format similar to that of the [Baremetal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). +This will take a format similar to that of the [Bare Metal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). We will create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. -BMC information can be used to change the power state of a baremetal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. This means that we need to ensure that there are installer validations and validations in the Baremetal Operator (BMO) to prevent control-plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these are not present during installation. +BMC information can be used to change the power state of a bare-metal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. This means that we need to ensure that there are installer validations and validations in the Bare Metal Operator (BMO) to prevent control-plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these are not present during installation. See the API Extensions section below for sample install-configs. @@ -210,8 +220,9 @@ Future Enhancements #### Day 2 Procedures -As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes can not be applied when the cluster is in a degraded state. -Such operations will only proceed when both peers are online and healthy. +As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes that would trigger a node reboot cannot proceed if the aforementioned reboot +would go over the `maxUnavailable` allowance specified in the machine config pool, which defaults to 1. For a two-node control plane, `maxUnavailable` should only ever +be set to 1 to help ensure that these events only proceed when both peers are online and healthy, or 0 in the case where the administrator wishes to temporarily disable these events. The experience of managing a two-node control-plane should be largely indistinguishable from that of a 3-node one. The primary exception is (re)booting one of the peers while the other is offline and expected to remain so. @@ -227,7 +238,8 @@ The full list of changes proposed is [summarized above](#summary-of-changes). Ea #### Feature Gate Changes -We will define a new `DualReplicaTopology` feature that can be enabled in `install-config.yaml` to ensure the clusters running this feature cannot be upgraded. +We will define a new `DualReplicaTopology` feature that can be enabled in `install-config.yaml` to ensure the clusters running this feature cannot be upgraded until +the feature is ready for general availability. #### Infrastructure API Changes @@ -287,7 +299,7 @@ pullSecret: '' sshKey: '' ``` -For platform baremetal, a valid configuration is quite similar. +For `platform: baremetal`, a valid configuration is quite similar. ``` apiVersion: v1 baseDomain: example.com @@ -314,8 +326,8 @@ pullSecret: '' sshKey: '' ``` -Unfortunately, Baremetal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the -Baremetal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for +Unfortunately, Bare Metal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the +Bare Metal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for in the installer. ``` apiVersion: v1 @@ -367,7 +379,7 @@ A cluster is assumed to be a 2-node cluster if the following statements are true The number of compute nodes is also expected to be zero, but this will not be enforced. -Additionally, we will enforce that 2-node clusters are only allowed on platform `None` or platform `Baremetal`. This is not a technical restriction, but +Additionally, we will enforce that 2-node clusters are only allowed on platform `none` or platform `baremetal`. This is not a technical restriction, but rather one that stems from a desire to limit the number of supportable configurations. If use cases emerge, cloud support for this topology may be considered in the future. @@ -398,8 +410,8 @@ Agent-Based installers. The core of this change is an update to the assisted-ser included in this, since we do not wish to store customer fencing credentials in a Red Hat database. In other words, installing using the Assisted Installer will only be supported using MCE. -#### BareMetal Operator Changes -Because paceamaker is the only entity allowed to take fencing actions on the control-plane nodes, the baremetal operator will need to be updated to ensure that `BareMetalHost` entries cannot be added for the control-plane nodes. +#### Bare Metal Operator Changes +Because paceamaker is the only entity allowed to take fencing actions on the control-plane nodes, the Bare Metal Operator will need to be updated to ensure that `BareMetalHost` entries cannot be added for the control-plane nodes. This will prevent power management operations from being triggered outside of the purview of pacemaker. #### Node Health Check Operator Changes @@ -444,11 +456,32 @@ Tools for extracting support information (must-gather tarballs) will be updated As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. The etcd containers are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. -#### Platform None vs. Baremetal +#### The Fencing Network +The [goals](#goals) section of this proposal highlights the intent of minimizing the scenarios that require manual administrator intervention and maximizing the cluster's +similarity to the experience of running a 3-node hyperconverged cluster - especially in regard to auto-recovering from the failure of either node. The key to delivering on these goals is fencing. + +In order be able to leverage fencing, two conditions must be met: +1. The fencing network must be available +2. Fencing must be properly configured + +Because of the criticality of the fencing network being available, the fencing network should be isolated from main network used by the cluster. This ensures that a network +disruption experienced between the nodes on the main network can still proceed with a recovery operation with an available fencing network. + +Addressing the matter of ensuring that fencing is configured properly is more challenging. While the installer should protect from some kinds of invalid configuration +(e.g. an incorrect BMC password), this is not a guarantee that the configuration is actually valid. For example, a user could accidentally enter the BMC credentials for +a different server altogether, and pacemaker would be none-the-wiser. The only way to ensure that fencing is configured properly is to test it. + +##### Fencing Health Check +To test that fencing is operating as intended in an installed cluster, a special fencing-verification Day-2 operation will be made available to users. At first, this would +exist as a Fencing Health Check script that is run to crash each of the nodes in turn, waiting for the cluster to recover between crash events. As a longer term objective, it +might make sense to explore if this could be integrated into the OpenShift client. It is recommended that cluster administrators run the Fencing Health Check operation +regularly for TNF clusters. A future improvement could add reminders for cluster administrators who haven't run the Fencing Health Check in the last 3 or 6 months. + +#### Platform `none` vs. `baremetal` One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. The bottom line is that it will take some engineering effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift. -Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into BareMetalHost CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Baremetal Platform team, we were advised against using the Baremetal Operator as an inventory. Its purpose/scope is provisioning nodes. +Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into `BareMetalHost` CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Bare Metal Platform team, we were advised against using the Bare Metal Operator as an inventory. Its purpose/scope is provisioning nodes. This means that the Baremetal Operator is not initially in scope for a two-node cluster because we don't intend to support compute nodes. However, if this requirement were to change for future business opportunities, it may still be useful to provide the user with an install-time option for deploying the Baremetal Operator. @@ -554,7 +587,7 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i 1. Mitigation: The CEO will run in a mode that does manage not etcd membership 3. Risk: Other operators that perform power-management functions could conflict with pacemaker. - 1. Mitigation: Update the Baremetal and Node Health Check operators to ensure control-plane nodes can not perform power operations for the control-plane nodes in the two-node topology. + 1. Mitigation: Update the Bare Metal and Node Health Check operators to ensure control-plane nodes can not perform power operations for the control-plane nodes in the two-node topology. 4. Risk: Rebooting the surviving peer would require human intervention before the cluster starts, increasing downtime and creating an admin burden at remote sites 1. Mitigation: Lifecycle events, such as upgrades and applying new `MachineConfig`s, are not permitted in a single-node degraded state @@ -580,7 +613,7 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i ### Drawbacks The two-node architecture represents yet another distinct install type for users to choose from, and therefore another addition to the test matrix -for baremetal installation variants. Because this topology has so many unique failure recovery paths, it also requires an in-depth new test +for bare-metal installation variants. Because this topology has so many unique failure recovery paths, it also requires an in-depth new test suite which can be used to exercise all of these failure recovery scenarios. More critically, this is the only variant of OpenShift that would recommend a regular maintenance check to ensure that failures that require @@ -635,25 +668,26 @@ Satisfying this demand would come with significant technical and support overhea ### CI The initial release of TNF should aim to build a regression baseline. -| Type | Name | Description | -| ----- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------- | -| Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | -| Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | -| Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | -| Job | Serial tests | The standard test suite (openshift/conformance/serial) for establishing a regression baseline between payloads. | -| Job | TechPreview | The standard test suite (openshift/conformance/parallel) run with TechPreview features enabled. | -| Suite | TNF Recovery | This is a new suite consisting of the tests listed below. | -| Test | Double Node failure | A new TNF test to detect if the cluster recovers if both nodes crash and are manually reset | -| Test | Cold boot | A new TNF test to detect if the cluster recovers if both nodes are stopped gracefully and then restarted together | -| Test | Node restart [^2] | A new TNF test to detect if the cluster recovers if a node is gracefully restarted. | -| Test | Node failure [^2] | A new TNF test to detect if the cluster recovers if a node crashes. | -| Test | Network failure [^2] | A new TNF test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | -| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. | -| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. | -| Test | Valid PDBs | A new TNF test to verify that PDBs are set to the correct configuration | -| Test | Conformant recovery | A new TNF test to verify recovery times for failure events are within the creteria defined in the requirements | -| Test | Fencing health check | A new TNF test to verify fencing health check process successful | -| Test | Replacing a control-plane node | A new TNF test to verify that you can replace a control-plane node in a 2-node cluster | +| Type | Name | Description | +| ----- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | +| Job | End-to-End tests (e2e) | The standard test suite (openshift/conformance/parallel) for establishing a regression baseline between payloads. | +| Job | Upgrade between z-streams | The standard test suite for evaluating upgrade behavior between payloads. | +| Job | Upgrade between y-streams [^1] | The standard test suite for evaluating upgrade behavior between payloads. | +| Job | Serial tests | The standard test suite (openshift/conformance/serial) for establishing a regression baseline between payloads. | +| Job | TechPreview | The standard test suite (openshift/conformance/parallel) run with TechPreview features enabled. | +| Suite | TNF Recovery | This is a new suite consisting of the tests listed below. | +| Test | Double Node failure | A new TNF test to detect if the cluster recovers if both nodes crash and are manually reset | +| Test | Cold boot | A new TNF test to detect if the cluster recovers if both nodes are stopped gracefully and then restarted together | +| Test | Node restart [^2] | A new TNF test to detect if the cluster recovers if a node is gracefully restarted. | +| Test | Node failure [^2] | A new TNF test to detect if the cluster recovers if a node crashes. | +| Test | Network failure [^2] | A new TNF test to detect if the cluster recovers if the network is disrupted such that a node is unavailable. | +| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. | +| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. | +| Test | Valid PDBs | A new TNF test to verify that PDBs are set to the correct configuration | +| Test | Conformant recovery | A new TNF test to verify recovery times for failure events are within the creteria defined in the requirements | +| Test | Fencing health check | A new TNF test to verify that the [Fencing Health Check](#fencing-health-check) process is successful | +| Test | Replacing a control-plane node | A new TNF test to verify that you can replace a control-plane node in a 2-node cluster | +| Test | Certificate rotation with an unhealthy node | A new TNF test to verify certificate rotation on a cluster with an unhealthy node that rejoins after the rotation | [^1]: This will be added after the initial release when more than one minor version of OpenShift is compatible with the topology. [^2]: These tests will be designed to make a component a randomly selected node fail. @@ -769,9 +803,30 @@ a close eye on when evaluating upgrade jobs for potential race conditions. ## Alternatives -* MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the TNF initiative which is on getting the OpenShift experience on two nodes +#### Active-Passive Single-Node OpenShift +One of the first options that comes to mind when thinking about how to run OpenShift with two control plane nodes at scale without the risk of split-brain is deploying 2 Single-Node OpenShift +instances and setting them up as an active node with a passive backup. + +This option requires the user to run external software like keepalived in front of the instances to ensure that only one instance is the active instance and receiving traffic at a time. + +The advantage of this this kind of set up is that failures of each instance is reliably deterministic. A failure of the active instance passes traffic over to the passive instance, and a failure of +the passive instance doesn't interrupt the active instance. Since both instances cannot be the active instance at the same time, there is no risk of a split-brain situation. Additionally, each Single +Node OpenShift instance is a cluster by itself - meaning it does not need the other instance to be operational for it to proceed with upgrade or configuration operations that require a node reboot. + +That said, there are major tradeoffs of structuring the nodes in the fashion. Firstly, all aspects of the customer's applications must be available on both nodes. This means that the cluster +administrator needs to actively synchronize both nodes so that either of them could become the active node at any given time. How this data is synchronized between the SNO instances is left as an a +challenge for the solutions architect. For some applications, this results in more overhead on each node since both need to be able to run fully independently. For example, a deployment that specifies +two replicas would end up with two replicas on each SNO instance. Furthermore, stateful workloads require special configuration to synchronize data between the active and passive clusters. Finally, +the end user cannot utilize the full capacity of both nodes at the same time. + +The bottom line is that this solution may be viable for simple setups with stateless workloads, but it becomes logistically much harder to deploy and maintain for workloads that need to maintain +synchronised state between the OpenShift instances. Solving the state synchronization challenges means having to solve the same core issues TNF faces around maintaining data integrity - without +actively benefiting from the computation power of the second server. + +#### MicroShift +MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the TNF initiative which is on getting the OpenShift experience on two nodes -* 2 SNO + KCP: +#### 2 SNO + KCP: [KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. With kcp, you can manage the two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. The main advantage of this approach is that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. From 5d3e84abeb866ca3f34e9a1572ba02c7590e9a85 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Tue, 11 Feb 2025 15:59:22 -0500 Subject: [PATCH 46/49] OCPBUGS-1468: [TNF] Added a section to address running with a failed node - Topology transitions may be discussed later. - Running TNF with a failed node is not the same as Single Node OpenShift --- enhancements/two-node-fencing/tnf.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 805abe898e..6c1fd26636 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -569,6 +569,20 @@ This section provides specific steps for how two-node clusters would handle inte 3. Stop failure is optionally escalated to a node failure (fencing) 4. Start failure defaults to leaving the service offline +#### Running Two Node OpenShift with Fencing with a Failed Node + +An interesting aspect of TNF is that should a node fail and remain in a failed state, the cluster recovery operation will allow the survivor to restart etcd as a cluster-of-one and resume normal +operations. In this state, we have an operation running cluster with a single control plane node. So, what is the difference between this state and running Single Node OpenShift? There are three key aspects: + +1. Operators that deploy to multiple nodes will become degraded. +2. Operations that would violate pod-disruption budgets will not work. +3. Lifecycle operations that would violate the `MaxUnavailable` setting of the control-plane [MachineConfigPool](https://docs.openshift.com/container-platform/4.17/updating/understanding_updates/understanding-openshift-update-duration.html#factors-affecting-update-duration_openshift-update-duration) cannot proceed. This includes MCO node reboots and cluster upgrades. + +In short - it is not recommended that users allow their clusters to remain in this semi-operational state longterm. It is intended help ensure that api-server and workloads are available as much as +possible, but it is not sufficient for the operation of a healthy cluster longterm. + +Because of the flexibility it can offer users to adjust their cluster capacity according to their needs, transitioning between control-plane topologies may be introduced for evaluation in a future enhancement. + #### Hypershift / Hosted Control Planes This topology is anti-synergistic with HyperShift. As the management cluster, a cost-sensitive control-plane runs counter to the the proposition of highly-scaleable hosted control-planes since your compute resources are limited. As the hosted cluster, the benefit of hypershift is that your control-planes are running as pods in the management cluster. Reducing the number of instances of control-plane nodes would trade the minimal cost of a third set of control-plane pods at the cost of having to implement fencing between your control-plane pods. From 51d12ea4b15b564878456f2642118bd7e6e382a7 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Thu, 13 Feb 2025 11:23:34 -0500 Subject: [PATCH 47/49] [TNF] Added a note explaining stance on adding compute nodes --- enhancements/two-node-fencing/tnf.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 6c1fd26636..c23a90e891 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -581,7 +581,20 @@ operations. In this state, we have an operation running cluster with a single co In short - it is not recommended that users allow their clusters to remain in this semi-operational state longterm. It is intended help ensure that api-server and workloads are available as much as possible, but it is not sufficient for the operation of a healthy cluster longterm. -Because of the flexibility it can offer users to adjust their cluster capacity according to their needs, transitioning between control-plane topologies may be introduced for evaluation in a future enhancement. +Because of the flexibility it can offer users to adjust their cluster capacity according to their needs, transitioning between control-plane topologies may be introduced for evaluation in a future +enhancement. + +#### Notes About Installing with or Adding Compute Nodes + +As currently envisioned, it doesn't make sense for a customer to install a new TNF cluster with compute nodes. If the cluster has compute capacity for 3 nodes, it should always be installed as a +highly available (3 control-plane) cluster. This becomes more complicated for customers that discover after installing a fleet of TNF clusters that they need to add compute capacity to meet a rise in +demand. The preferred solution for this would be to support some kind of topology transition to a highly available compact (3-node) cluster. This is not currently supported, and may never be in +OpenShift. + +The alternative that we expect to be asked for is to support compute nodes in TNF. For this reason, we'd like to avoid investing heavily in preventing compute nodes from joining the cluster - at +install or runtime. It's worth noting that no variant of OpenShift currently actively prevents compute nodes from being added to the cluster. + +The only boundary we'd set to is declare compute nodes as unsupported in documentation. This would limit the amount of technical debt that we will likely need to undo in a future release. #### Hypershift / Hosted Control Planes From e8f1e5b8569e28f7b5bfac8becb6b6be0541ca17 Mon Sep 17 00:00:00 2001 From: Jeremy Poulin Date: Tue, 11 Feb 2025 20:47:20 -0500 Subject: [PATCH 48/49] [TNF] Unified document around a 200 column wrap to eliminate manual wrapping. --- enhancements/two-node-fencing/tnf.md | 518 ++++++++++++++------------- 1 file changed, 277 insertions(+), 241 deletions(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index c23a90e891..b152d415a7 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -34,21 +34,28 @@ tracking-link: ## Terms -**RHEL-HA** - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense organizations, as well as used on trains, satellites, and expeditions to Mars. +**RHEL-HA** - a general-purpose clustering stack shipped by Red Hat (and others) primarily consisting of Corosync and Pacemaker. Known to be in use by airports, financial exchanges, and defense +organizations, as well as used on trains, satellites, and expeditions to Mars. -**Corosync** - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum capabilities. +**Corosync** - a Red Hat led [open-source project](https://corosync.github.io/corosync/) that provides a consistent view of cluster membership, reliable ordered messaging, and flexible quorum +capabilities. -**Pacemaker** - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover for critical services and applications. +**Pacemaker** - a Red Hat led [open-source project](https://clusterlabs.org/pacemaker/doc/) that works in conjunction with Corosync to provide general-purpose fault tolerance and automatic failover +for critical services and applications. -**Fencing** - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent datasets. +**Fencing** - the process of “somehow” isolating or powering off malfunctioning or unresponsive nodes to prevent them from causing further harm, such as data corruption or the creation of divergent +datasets. -**Quorum** - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on fencing are also possible. +**Quorum** - having the minimum number of members required for decision-making. The most common threshold is 1 plus half the total number of members, though more complicated algorithms predicated on +fencing are also possible. * C-quorum: quorum as determined by Corosync members and algorithms * E-quorum: quorum as determined by etcd members and algorithms -**Split-brain** - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically, it results in data loss or corruption. +**Split-brain** - a scenario where a set of peers are separated into groups smaller than the quorum threshold AND peers decide to host services already running in other groups. Typically, it results +in data loss or corruption. -**MCO** - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig custom resources. +**MCO** - Machine Config Operator. This operator manages updates to the node's systemd, cri-o/kubelet, kernel, NetworkManager, etc., and can write custom files to it, configurable by MachineConfig +custom resources. **ABI** - Agent-Based Installer. A installation path through the core-installer that leverages the assisted service to facilate bare-metal installations. Can be used in disconnected environments. @@ -60,31 +67,32 @@ tracking-link: ## Summary -Leverage traditional high-availability concepts and technologies to provide a container management solution that has a minimal footprint but remains resilient to single -node-level failures suitable for customers with numerous geographically dispersed locations. +Leverage traditional high-availability concepts and technologies to provide a container management solution that has a minimal footprint but remains resilient to single node-level failures suitable +for customers with numerous geographically dispersed locations. ## Motivation -Customers with hundreds, or even tens of thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience -to node-level failures but does not come with a traditional three-node footprint and/or price tag. +Customers with hundreds, or even tens of thousands, of geographically dispersed locations are asking for a container management solution that retains some level of resilience to node-level failures +but does not come with a traditional three-node footprint and/or price tag. -The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of -retail and telcos - even when the third node is a "cheap" one that doesn't run workloads. +The need for some level of fault tolerance prevents the applicability of Single Node OpenShift (SNO), and a converged 3-node cluster is cost prohibitive at the scale of retail and telcos - even when +the third node is a "cheap" one that doesn't run workloads. -While the degree of resiliency achievable with two nodes is not suitable for safety-critical workloads like emergency services, this proposal aims to deliver a solution -for workloads that can trade off some amount of determinism and reliability in exchange for cost-effective deployments at scale that fully utilize the capacity of both nodes -while minimizing the time to recovery for node-level failures of either node. +While the degree of resiliency achievable with two nodes is not suitable for safety-critical workloads like emergency services, this proposal aims to deliver a solution for workloads that can trade +off some amount of determinism and reliability in exchange for cost-effective deployments at scale that fully utilize the capacity of both nodes while minimizing the time to recovery for node-level +failures of either node. -The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. -This requires our solution to provide a management experience consistent with "normal" OpenShift deployments and be compatible with the full ecosystem of Red Hat and partner -workloads designed for OpenShift. +The benefits of the cloud-native approach to developing and deploying applications are increasingly being adopted in edge computing. This requires our solution to provide a management experience +consistent with "normal" OpenShift deployments and be compatible with the full ecosystem of Red Hat and partner workloads designed for OpenShift. ### User Stories -* As a solutions architect for a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage applications without incurring the cost of a third node at scale. +* As a solutions architect for a large enterprise with multiple remote sites, I want a cost-effective OpenShift cluster solution so that I can manage applications without incurring the cost of a third + node at scale. * As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to leverage the full capacity of both control-plane nodes to run my workloads. * As a solutions architect for a large enterprise running workloads on a minimal OpenShift footprint, I want to minimize time-to-recovery and data loss for my workloads when a node fails. -* As an OpenShift cluster administrator, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized and the cluster fully recovers. +* As an OpenShift cluster administrator, I want a safe and automated method for handling the failure of a single node so that the downtime of the control-plane is minimized and the cluster fully + recovers. ### Goals @@ -115,13 +123,13 @@ workloads designed for OpenShift. ## Proposal -We will use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported two-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. -We will run etcd as a voting member on both nodes. -We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kubelet units, as well as create a new Open Cluster Framework (OCF) script for etcd. -The existing startup order of cri-o, then kubelet, then etcd will be preserved. -The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the static pod definitions, the contents of which remain under the exclusive control of the cluster-etcd-operator (CEO). +We will use the RHEL-HA stack (Corosync, and Pacemaker), which has been used to deliver supported two-node cluster experiences for multiple decades, to manage cri-o, kubelet, and the etcd daemon. We +will run etcd as a voting member on both nodes. We will take advantage of RHEL-HA's native support for systemd and re-use the standard cri-o and kubelet units, as well as create a new Open Cluster +Framework (OCF) script for etcd. The existing startup order of cri-o, then kubelet, then etcd will be preserved. The `etcdctl`, `etcd-metrics`, and `etcd-readyz` containers will remain part of the +static pod definitions, the contents of which remain under the exclusive control of the cluster-etcd-operator (CEO). -In the case of an unreachable peer, we will use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) the unreachable node and ensure that it cannot harm while the remaining node continues. +In the case of an unreachable peer, we will use RedFish compatible Baseboard Management Controllers (BMCs) as our primary mechanism to power off (fence) the unreachable node and ensure that it cannot +harm while the remaining node continues. Upon a peer failure, the RHEL-HA components on the survivor will fence the peer and use the OCF script to restart etcd as a new cluster-of-one. @@ -131,9 +139,8 @@ Upon an etcd failure, the OCF script will detect the issue and try to restart et In both cases, the control-plane's dependence on etcd will cause it to respond with errors until etcd has been restarted. -Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees its peer. -If the failed peer is likely to remain offline for an extended period, admin confirmation is required on the remaining node to allow it to start OpenShift. -This functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. +Upon rebooting, the RHEL-HA components ensure that a node remains inert (not running cri-o, kubelet, or etcd) until it sees its peer. If the failed peer is likely to remain offline for an extended +period, admin confirmation is required on the remaining node to allow it to start OpenShift. This functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details. When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer. @@ -141,37 +148,41 @@ When starting etcd, the OCF script will use etcd's cluster ID and version counte ### Summary of Changes At a glance, here are the components we are proposing to change: -| Component | Change | -| ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| [Feature Gates](#feature-gate-changes) | Add a new `DualReplicaTopology` feature which can be enabled via the `CustomNoUpgrade` feature set | -| [Infrastructure API](#infrastructure-api-changes) | Add `DualReplica` as a new value for `ControlPlaneTopology` | -| [ETCD Operator](#etcd-operator-changes) | Add mode for disabling management of the etcd container, scaling strategy for 2 nodes, and a controller for initializing pacemaker | -| [Install Config](#install-config-changes) | Update install config API to accept fencing credentials for `platform: None` and `platform: Baremetal` | -| [Installer](#installer-changes) | Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter | -| [MCO](#mco-changes) | Add an MCO extension for installating pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 | -| [Authentication Operator](#authentication-operator-changes) | Update operator to accept minimum 1 kube api servers when `ControlPlaneTopology` is `DualReplica` | -| [Hosted Control Plane](#hosted-control-plane-changes) | Disallow HyperShift from installing on the `DualReplica` topology | -| [OLM Filtering](#olm-filtering-changes) | Leverage support for OLM to filter operators based off of control plane topology | -| [Assisted Installer Family](#assisted-installer-family-changes) | Add support for deploying bare-metal clusters with 2 control-plane nodes using Assisted Installer and Agent-Based Installer | -| [Bare Metal Operator](#bare-metal-operator-changes) | Prevent power-management of control-plane nodes when the infrastructureTopology is set to `DualReplica` | -| [Node Health Check Operator](#node-health-check-operator-changes) | Prevent fencing of control-plane nodes when the infrastructureTopology is set to `DualReplica` | + +| Component | Change | +| ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | +| [Feature Gates](#feature-gate-changes) | Add a new `DualReplicaTopology` feature which can be enabled via the `CustomNoUpgrade` feature set | +| [Infrastructure API](#infrastructure-api-changes) | Add `DualReplica` as a new value for `ControlPlaneTopology` | +| [ETCD Operator](#etcd-operator-changes) | Add a mode for disabling management of the etcd container, a new scaling strategy, and a controller for initializing pacemaker | +| [Install Config](#install-config-changes) | Update install config API to accept fencing credentials for `platform: None` and `platform: Baremetal` | +| [Installer](#installer-changes) | Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter | +| [MCO](#mco-changes) | Add an MCO extension for installating pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 | +| [Authentication Operator](#authentication-operator-changes) | Update operator to accept minimum 1 kube api servers when `ControlPlaneTopology` is `DualReplica` | +| [Hosted Control Plane](#hosted-control-plane-changes) | Disallow HyperShift from installing on the `DualReplica` topology | +| [OLM Filtering](#olm-filtering-changes) | Leverage support for OLM to filter operators based off of control plane topology | +| [Assisted Installer Family](#assisted-installer-family-changes) | Add support for deploying bare-metal clusters with 2 control-plane nodes using Assisted Installer and Agent-Based Installer | +| [Bare Metal Operator](#bare-metal-operator-changes) | Prevent power-management of control-plane nodes when the infrastructureTopology is set to `DualReplica` | +| [Node Health Check Operator](#node-health-check-operator-changes) | Prevent fencing of control-plane nodes when the infrastructureTopology is set to `DualReplica` | ### Workflow Description #### Cluster Creation -User creation of a two-node control-plane is possible via the Assisted Installer and the Agent-Based Installer (ABI). The initial implementation will focus on providing support for the Assisted Installer in managed cluster environments (i.e. ACM), followed by stand-alone cluster support via the Agent-Based Installer. -The requirement that the cluster can be deployed using only 2 nodes is key because requiring a third bare-metal server for installation can be expensive when deploying bare metal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap node before it is rebooted into a control-plane node. +User creation of a two-node control-plane is possible via the Assisted Installer and the Agent-Based Installer (ABI). The initial implementation will focus on providing support for the Assisted +Installer in managed cluster environments (i.e. ACM), followed by stand-alone cluster support via the Agent-Based Installer. The requirement that the cluster can be deployed using only 2 nodes is key +because requiring a third bare-metal server for installation can be expensive when deploying bare metal at scale. To accomplish this, deployments will use one of the target machines as the bootstrap +node before it is rebooted into a control-plane node. -A critical transition during bootstrapping is when the bootstrap reboots into the control-plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained as the machine reboots into a second control-plane. +A critical transition during bootstrapping is when the bootstrap reboots into the control-plane node. Before this reboot, it needs to be removed from the etcd cluster so that quorum can be maintained +as the machine reboots into a second control-plane. Otherwise, the procedure follows the standard flow except for the configuration of 2 nodes instead of 3. -To constrain the scope of support, we've targeted Assisted Installer (in ACM) and Agent-Based Installer (ABI) as our supported installation paths. Support for other installation paths -may be reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via `openshift-install` using an -auxiliary bootstrap node but we don't intend to support this for customers unless this becomes a business requirement. Similarly, ZTP may be evaluated as a future offering for clusters -deployed by ACM environments via Multi-Cluster Engine (MCE), Assisted Installer, and Bare Metal Operator. +To constrain the scope of support, we've targeted Assisted Installer (in ACM) and Agent-Based Installer (ABI) as our supported installation paths. Support for other installation paths may be +reevaluated as business requirements change. For example, it is technically possible to install a cluster with two control-plane nodes via `openshift-install` using an auxiliary bootstrap node but we +don't intend to support this for customers unless this becomes a business requirement. Similarly, ZTP may be evaluated as a future offering for clusters deployed by ACM environments via Multi-Cluster +Engine (MCE), Assisted Installer, and Bare Metal Operator. Because BMC passwords are being collected to initialize fencing, the Assisted Installer SaaS offering will not be available (to avoid storing customer BMC credentials in a Red Hat database). @@ -185,52 +196,64 @@ Three aspects of cluster creation need to happen for a vanilla two-node cluster ###### Transitioning etcd Management to RHEL-HA An important facility of the installation flow is the transition from a CEO deployed etcd to one controlled by RHEL-HA. The basic transition works as follows: -1. [MCO extensions](https://docs.openshift.com/container-platform/4.17/machine_configuration/machine-configs-configure.html#rhcos-add-extensions_machine-configs-configure) are used to ensure that the pacemaker and corosync RPMs are installed. The installer also creates MachineConfig manifests to pre-configure resource agents. -2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (likely a new controller running in CEO) will run a command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. -3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd container from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd containers to be gone so they can restart them using Podman. -4. The installation proceeds as normal once the containers start. -If for some reason, the etcd containers cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. +1. [MCO extensions](https://docs.openshift.com/container-platform/4.17/machine_configuration/machine-configs-configure.html#rhcos-add-extensions_machine-configs-configure) are used to ensure that the + pacemaker and corosync RPMs are installed. The installer also creates MachineConfig manifests to pre-configure resource agents. +2. Upon detection that the cluster infrastructure is using the DualReplica controlPlaneTopology in the infrastructure config, an in-cluster entity (likely a new controller running in CEO) will run a + command on one of the cluster nodes to initialize pacemaker. The outcome of this is that the resource agent will be started on both nodes. +3. The aforementioned in-cluster entity will signal CEO to relinquish control of etcd by setting CEO's `managedEtcdKind` to `External`. When this happens, CEO immediately removes the etcd container + from the static pod configs. The resource agents for etcd are running from step 2, and they are configured to wait for etcd containers to be gone so they can restart them using Podman. +4. The installation proceeds as normal once the containers start. If for some reason, the etcd containers cannot be started, then the installation will fail. The installer will pull logs from the +control-plane nodes to provide context for this failure. ###### Configuring Fencing Via MCO -Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. -To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes). -This will take a format similar to that of the [Bare Metal Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). -We will create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the [BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. - -BMC information can be used to change the power state of a bare-metal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to prevent conflicting requests to change the machine state. This means that we need to ensure that there are installer validations and validations in the Bare Metal Operator (BMO) to prevent control-plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have the same considerations but these are not present during installation. +Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation +successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. To ensure rapid fencing using +pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes). This will take a format similar to that of +the [Bare Metal +Operator](https://docs.openshift.com/container-platform/4.17/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow). We will +create a new MachineConfig that writes BMC credentials to the control-plane disks. This will resemble the BMC specification used by the +[BareMetalHost](https://docs.openshift.com/container-platform/4.17/rest_api/provisioning_apis/baremetalhost-metal3-io-v1alpha1.html#spec-bmc) CRD. + +BMC information can be used to change the power state of a bare-metal machine, so it's critically important that we ensure that pacemaker is the **only entity** responsible for these operations to +prevent conflicting requests to change the machine state. This means that we need to ensure that there are installer validations and validations in the Bare Metal Operator (BMO) to prevent +control-plane nodes from having power management enabled in a two-node topology. Additionally, optional operators like Node Health Check, Self Node Remediation, and Fence Agents Remediation must have +the same considerations but these are not present during installation. See the API Extensions section below for sample install-configs. For a two-node cluster to be successful, we need to ensure the following: 1. The BMC secrets for RHEL-HA are created on disk during bootstrapping by the OpenShift installer via a MachineConfig. -2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error which will cause degradation of the in cluster operator and would fail the installation process. +2. When pacemaker is initialized by the in-cluster entity responsible for starting pacemaker, pacemaker will try to set up fencing with this secret. If this is not successful, it throws an error which + will cause degradation of the in cluster operator and would fail the installation process. 3. Pacemaker periodically checks that the fencing agent is healthy (i.e. can connect to the BMC) and will create an alert if it cannot access the BMC. * In this case, in order to allow a simple manual recovery by the user a script will be available on the node which will reset Pacemaker with the new fencing credentials. -4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this alert will mean that pacemaker can only provide a best-effort recovery - so operations that require fencing will need manual recovery. -5. When manual recovery is triggered by running the designated script on the node, it'll also update the Secret in order to make sure the Secret is aligned with the credentials kept in Pacemaker's cib file. +4. The cluster will continue to run normally in the state where the BMC cannot be accessed, but ignoring this alert will mean that pacemaker can only provide a best-effort recovery - so operations + that require fencing will need manual recovery. +5. When manual recovery is triggered by running the designated script on the node, it'll also update the Secret in order to make sure the Secret is aligned with the credentials kept in Pacemaker's cib + file. Future Enhancements 1. Allowing usage of external credentials storage services such as Vault or Conjur. In order to support this: * Expose the remote access credentials to the in-cluster operator - * We will need an indication for using that particular mode + * We will need an indication for using that particular mode * Introduce another operator (such as Secrets Store CSI driver) to consume the remote credentials * Make sure that the relevant operator for managing remote credentials is part of the setup (potentially by using the Multi-Operator-Manager operator) - 2. Allowing refresh of the fencing credentials during runtime. One way to do so would be for the in-cluster operator to watch for a credential change made by the user, and update the credentials stored in Pacemaker upon such a change. + 2. Allowing refresh of the fencing credentials during runtime. One way to do so would be for the in-cluster operator to watch for a credential change made by the user, and update the credentials + stored in Pacemaker upon such a change. #### Day 2 Procedures -As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes that would trigger a node reboot cannot proceed if the aforementioned reboot -would go over the `maxUnavailable` allowance specified in the machine config pool, which defaults to 1. For a two-node control plane, `maxUnavailable` should only ever -be set to 1 to help ensure that these events only proceed when both peers are online and healthy, or 0 in the case where the administrator wishes to temporarily disable these events. +As per a standard 3-node control-plane, OpenShift upgrades and `MachineConfig` changes that would trigger a node reboot cannot proceed if the aforementioned reboot would go over the `maxUnavailable` +allowance specified in the machine config pool, which defaults to 1. For a two-node control plane, `maxUnavailable` should only ever be set to 1 to help ensure that these events only proceed when both +peers are online and healthy, or 0 in the case where the administrator wishes to temporarily disable these events. -The experience of managing a two-node control-plane should be largely indistinguishable from that of a 3-node one. -The primary exception is (re)booting one of the peers while the other is offline and expected to remain so. +The experience of managing a two-node control-plane should be largely indistinguishable from that of a 3-node one. The primary exception is (re)booting one of the peers while the other is offline and +expected to remain so. -As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. -Should the admin wish for the control-plane to start, the admin will need to execute a supplied confirmation command on the active cluster node. -This command will grant quorum to the RHEL-HA components, authorizing it to fence its peer and start etcd as a cluster-of-one in read/write mode. -Confirmation can be given at any point and optionally make use of SSH to facilitate initiation by an external script. +As in a 3-node control-plane cluster, starting only one node is not expected to result in a functioning cluster. Should the admin wish for the control-plane to start, the admin will need to execute a +supplied confirmation command on the active cluster node. This command will grant quorum to the RHEL-HA components, authorizing it to fence its peer and start etcd as a cluster-of-one in read/write +mode. Confirmation can be given at any point and optionally make use of SSH to facilitate initiation by an external script. ### API Extensions @@ -238,38 +261,42 @@ The full list of changes proposed is [summarized above](#summary-of-changes). Ea #### Feature Gate Changes -We will define a new `DualReplicaTopology` feature that can be enabled in `install-config.yaml` to ensure the clusters running this feature cannot be upgraded until -the feature is ready for general availability. +We will define a new `DualReplicaTopology` feature that can be enabled in `install-config.yaml` to ensure the clusters running this feature cannot be upgraded until the feature is ready for general +availability. #### Infrastructure API Changes -A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling. -We will define a new value for the `infrastructureTopology` field of the Infrastructure config's `TopologyMode` enum. -Specifically, the value of `DualReplica` will be added to the currently supported list, which includes `HighlyAvailable`, `SingleReplica`, and `External`. +A mechanism is needed for components of the cluster to understand that this is a two-node control-plane topology that may require different handling. We will define a new value for the +`infrastructureTopology` field of the Infrastructure config's `TopologyMode` enum. Specifically, the value of `DualReplica` will be added to the currently supported list, which includes +`HighlyAvailable`, `SingleReplica`, and `External`. -InfrastructureTopology is assumed to be immutable by components of the cluster. While there is strong interest in changing this to allow for topology -transitions to and from TNF, this is beyond the scope of this enhancement proposal and should be detailed in its own enhancement. +InfrastructureTopology is assumed to be immutable by components of the cluster. While there is strong interest in changing this to allow for topology transitions to and from TNF, this is beyond the +scope of this enhancement proposal and should be detailed in its own enhancement. #### etcd Operator Changes -Initially, the creation of an etcd cluster will be driven in the same way as other platforms. -Once the cluster has two members, the etcd daemon will be removed from the static pod definition and recreated as a resource controlled by RHEL-HA. -At this point, the cluster-etcd-operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by RHEL-HA can be disabled. -This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only succeed if the control-plane is healthy. +Initially, the creation of an etcd cluster will be driven in the same way as other platforms. Once the cluster has two members, the etcd daemon will be removed from the static pod definition and +recreated as a resource controlled by RHEL-HA. At this point, the cluster-etcd-operator (CEO) will be made aware of this change so that some membership management functionality that is now handled by +RHEL-HA can be disabled. This will be achieved by having the same entity that drives the configuration of RHEL-HA use the OpenShift API to update a field in the CEO's `ConfigMap` - which can only +succeed if the control-plane is healthy. -To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the `Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. -This will allow the use of a credential scoped to `ConfigMap`s in the `openshift-etcd-operator` namespace, to make the change. +To enable this flow, we propose the addition of a `managedEtcdKind` field which defaults to `Cluster` but will be set to `External` during installation, and will only be respected if the +`Infrastructure` CR's `TopologyMode` is `DualReplicaTopologyMode`. This will allow the use of a credential scoped to `ConfigMap`s in the `openshift-etcd-operator` namespace, to make the change. -The plan is for this to be changed by one of the nodes during pacemaker initialization. Pacemaker initialization should be initiated by CEO when it detects that the cluster controlPlane topology is set to `DualReplica`. +The plan is for this to be changed by one of the nodes during pacemaker initialization. Pacemaker initialization should be initiated by CEO when it detects that the cluster controlPlane topology is +set to `DualReplica`. -While set to `External`, CEO will still need to render the configuration for the etcd container in a place where it can be consumed by pacemaker. This ensures that the etcd instance managed by pacemaker can be updated -accordingly in the case of a upgrade event or whenever certificates are rotated. +While set to `External`, CEO will still need to render the configuration for the etcd container in a place where it can be consumed by pacemaker. This ensures that the etcd instance managed by +pacemaker can be updated accordingly in the case of a upgrade event or whenever certificates are rotated. -In case CEO will be used as the in-cluster operator responsible for setting up Pacemaker fencing it'll require root permissions which are currently mandatory to run the required pcs commands. -Some mitigation or alternatives might be: -- Use a different (new) in-cluster operator to set up Pacemaker fencing - - However, this approach contradicts the goal of reducing the OCP release payload, as introducing a new core operator would increase its size instead of streamlining it. Additionally, adding the operator would require more effort (release payload changes, CI setup, etc.) compared to integrating an operand into CEO. It would also still require root access, potentially raising similar concerns as using CEO, just with a different audience. -- Worth noting that the HA team suggests that there is a plan to adjust the pcs to a full client server architecture which will allow the pcs commands to run without root privileges. A partial faster solution may be provided by the HA team for a specific set of commands used by the pcs client. +In case CEO will be used as the in-cluster operator responsible for setting up Pacemaker fencing it'll require root permissions which are currently mandatory to run the required pcs commands. Some +mitigation or alternatives might be: +- Use a different (new) in-cluster operator to set up Pacemaker fencing + - However, this approach contradicts the goal of reducing the OCP release payload, as introducing a new core operator would increase its size instead of streamlining it. Additionally, adding the + operator would require more effort (release payload changes, CI setup, etc.) compared to integrating an operand into CEO. It would also still require root access, potentially raising similar + concerns as using CEO, just with a different audience. +- Worth noting that the HA team suggests that there is a plan to adjust the pcs to a full client server architecture which will allow the pcs commands to run without root privileges. A partial faster + solution may be provided by the HA team for a specific set of commands used by the pcs client. #### Install Config Changes @@ -326,9 +353,8 @@ pullSecret: '' sshKey: '' ``` -Unfortunately, Bare Metal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the -Bare Metal Operator and the pacemaker fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for -in the installer. +Unfortunately, Bare Metal Operator already has a place to specify bmc credentials. However, providing credentials like this will result in conflicts as both the Bare Metal Operator and the pacemaker +fencing agent will have control over the machine state. In short, this example shows an invalid configuration that we must check for in the installer. ``` apiVersion: v1 baseDomain: example.com @@ -369,9 +395,9 @@ sshKey: '' ``` #### Installer Changes -Aside from the changes mentioned above detailing the new install config, the installer will be also be responsible for detecting a valid two-node footprint -(one that specifies two control-plane nodes and zero arbiters) and populating the nodes with the configuration files and resource scripts needed by pacemaker. -t will also need update the cluster's FeatureGate CR in order to enable the `CustomNoUpgrade` feature set with the `DualReplicaTopology` feature. +Aside from the changes mentioned above detailing the new install config, the installer will be also be responsible for detecting a valid two-node footprint (one that specifies two control-plane nodes +and zero arbiters) and populating the nodes with the configuration files and resource scripts needed by pacemaker. t will also need update the cluster's FeatureGate CR in order to enable the +`CustomNoUpgrade` feature set with the `DualReplicaTopology` feature. A cluster is assumed to be a 2-node cluster if the following statements are true: 1. The number of control-plane replicas is set to 2 @@ -379,45 +405,43 @@ A cluster is assumed to be a 2-node cluster if the following statements are true The number of compute nodes is also expected to be zero, but this will not be enforced. -Additionally, we will enforce that 2-node clusters are only allowed on platform `none` or platform `baremetal`. This is not a technical restriction, but -rather one that stems from a desire to limit the number of supportable configurations. If use cases emerge, cloud support for this topology may be considered -in the future. +Additionally, we will enforce that 2-node clusters are only allowed on platform `none` or platform `baremetal`. This is not a technical restriction, but rather one that stems from a desire to limit +the number of supportable configurations. If use cases emerge, cloud support for this topology may be considered in the future. #### MCO Changes -The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.19 timeframe. -A switch to [MCO Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. +The delivery of RHEL-HA components will be opaque to the user and be delivered as an [MCO Extension](../rhcos/extensions.md) in the 4.19 timeframe. A switch to [MCO +Layering](../ocp-coreos-layering/ocp-coreos-layering.md ) will be investigated once it is GA in a shipping version of OpenShift. -Additionally, in order to ensure the cluster can upgrade safely, the MachineConfigPool `maxUnavailable` control-plane nodes will be set to 1. This should prevent -upgrades from trying to proceed if a node is unavailable. +Additionally, in order to ensure the cluster can upgrade safely, the MachineConfigPool `maxUnavailable` control-plane nodes will be set to 1. This should prevent upgrades from trying to proceed if a +node is unavailable. #### Authentication Operator Changes -The authentication operator is sensitive to the number of kube-api replicas running in the cluster for -[test stability reasons](https://github.com/openshift/cluster-authentication-operator/commit/a08be2324f36ce89908f695a6ff3367ad86c6b78#diff-b6f4bf160fd7e801eadfbfac60b84bd00fcad21a0979c2d730549f10db015645R158-R163). -To get around this, it runs a [readiness check](https://github.com/openshift/cluster-authentication-operator/blob/2d71f164af3f3e9c84eb40669d330df2871956f5/pkg/controllers/readiness/unsupported_override.go#L58) -that needs to be updated to allow for a minimum of two replicas available when using the `DualReplica` infrastructure topology to ensure test stability. +The authentication operator is sensitive to the number of kube-api replicas running in the cluster for [test stability +reasons](https://github.com/openshift/cluster-authentication-operator/commit/a08be2324f36ce89908f695a6ff3367ad86c6b78#diff-b6f4bf160fd7e801eadfbfac60b84bd00fcad21a0979c2d730549f10db015645R158-R163). +To get around this, it runs a [readiness +check](https://github.com/openshift/cluster-authentication-operator/blob/2d71f164af3f3e9c84eb40669d330df2871956f5/pkg/controllers/readiness/unsupported_override.go#L58) that needs to be updated to +allow for a minimum of two replicas available when using the `DualReplica` infrastructure topology to ensure test stability. #### Hosted Control Plane Changes -Two-node clusters are no compatible with hosted control planes. A check will be needed to disallow hosted control planes when the infrastructureTopology is set -to `DualReplica`. +Two-node clusters are no compatible with hosted control planes. A check will be needed to disallow hosted control planes when the infrastructureTopology is set to `DualReplica`. #### OLM Filtering Changes -Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include -the `DualReplica` infrastructureTopology as an option that operators can levarage to communicate cluster-compatibility. +Layering on top of the enhancement proposal for [Two Node with Arbiter (TNA)](../arbiter-clusters.md#olm-filter-addition), it would be ideal to include the `DualReplica` infrastructureTopology as an +option that operators can levarage to communicate cluster-compatibility. #### Assisted Installer Family Changes -In order to achieve the requirement for deploying two-node openshift with only two-nodes, we will add support for installing using 2 nodes in the Assisted and -Agent-Based installers. The core of this change is an update to the assisted-service validation. Care must be taken to ensure that the SaaS offering is not -included in this, since we do not wish to store customer fencing credentials in a Red Hat database. In other words, installing using the Assisted Installer will -only be supported using MCE. +In order to achieve the requirement for deploying two-node openshift with only two-nodes, we will add support for installing using 2 nodes in the Assisted and Agent-Based installers. The core of this +change is an update to the assisted-service validation. Care must be taken to ensure that the SaaS offering is not included in this, since we do not wish to store customer fencing credentials in a Red +Hat database. In other words, installing using the Assisted Installer will only be supported using MCE. #### Bare Metal Operator Changes -Because paceamaker is the only entity allowed to take fencing actions on the control-plane nodes, the Bare Metal Operator will need to be updated to ensure that `BareMetalHost` entries cannot be added for the control-plane nodes. -This will prevent power management operations from being triggered outside of the purview of pacemaker. +Because paceamaker is the only entity allowed to take fencing actions on the control-plane nodes, the Bare Metal Operator will need to be updated to ensure that `BareMetalHost` entries cannot be added +for the control-plane nodes. This will prevent power management operations from being triggered outside of the purview of pacemaker. #### Node Health Check Operator Changes -The Node Health Check operator allows a user to create fencing requests, which are in turn remediated by other optional operators. To ensure that the control-plane nodes -cannot end up in a reboot deadlock between fencing agents, the Node Health Check operator should be rendered inert on this topology. Another approach would -be to enforce this limitation for just the control-plane nodes, but this proposal assumes that no dedicated compute nodes will be run in this topology. +The Node Health Check operator allows a user to create fencing requests, which are in turn remediated by other optional operators. To ensure that the control-plane nodes cannot end up in a reboot +deadlock between fencing agents, the Node Health Check operator should be rendered inert on this topology. Another approach would be to enforce this limitation for just the control-plane nodes, but +this proposal assumes that no dedicated compute nodes will be run in this topology. ### Topology Considerations @@ -425,74 +449,88 @@ TNF represents a new topology and is not appropriate for use with HyperShift, SN #### Standalone Clusters -Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. To achieve this, we -will need to enable the installation of two-node clusters via the Agent-Based Installer to ensure that we are still meeting the installation requirement of using -only 2 nodes. +Two-node OpenShift is first and foremost a topology of OpenShift, so it should be able to run without any assumptions of a cluster manager. To achieve this, we will need to enable the installation of +two-node clusters via the Agent-Based Installer to ensure that we are still meeting the installation requirement of using only 2 nodes. ### Implementation Details/Notes/Constraints -While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to bootstrap from one of the target machines to remove the requirement for a bootstrap node. +While the target installation requires exactly 2 nodes, this will be achieved by proving out the "bootstrap plus 2 nodes" flow in the core installer and then using assisted-service-based installers to +bootstrap from one of the target machines to remove the requirement for a bootstrap node. -So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure topology. +So far, we've discovered topology-sensitive logic in ingress, authentication, CEO, and the cluster-control-plane-machineset-operator. We expect to find others once we introduce the new infrastructure +topology. -Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster TNF setup operator or a function of CEO triggering a script on one of the control-plane nodes. -This script needs to be run with root permissions, so this is another factor to consider when evaluating if a new in-cluster operator is needed. -Regardless, this initialization will require that RedFish details have been collected by the installer and synced to the nodes. +Once installed, the configuration of the RHEL-HA components will be done via an in-cluster entity. This entity could be a dedicated in-cluster TNF setup operator or a function of CEO triggering a +script on one of the control-plane nodes. This script needs to be run with root permissions, so this is another factor to consider when evaluating if a new in-cluster operator is needed. Regardless, +this initialization will require that RedFish details have been collected by the installer and synced to the nodes. Sensible defaults will be chosen where possible, and user customization only where necessary. -This RHEL-HA initialization script will also configure a fencing priority for the nodes - alphabetically by name. The priority takes the form of a delay, where the second node will wait 20 seconds to prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. +This RHEL-HA initialization script will also configure a fencing priority for the nodes - alphabetically by name. The priority takes the form of a delay, where the second node will wait 20 seconds to +prevent parallel fencing operations during a primary-network outage where each side powers off the other - resulting in a total cluster outage. -RHEL-HA has no real understanding of the resources (IP addresses, file systems, databases, even virtual machines) it manages. -It relies on resource agents to understand how to check the state of a resource, as well as start and stop them to achieve the desired target state. -How a given agent uses these actions, and associated states, to model the resource is opaque to the cluster and depends on the needs of the underlying resource. +RHEL-HA has no real understanding of the resources (IP addresses, file systems, databases, even virtual machines) it manages. It relies on resource agents to understand how to check the state of a +resource, as well as start and stop them to achieve the desired target state. How a given agent uses these actions, and associated states, to model the resource is opaque to the cluster and depends on +the needs of the underlying resource. -Resource agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. -The latter is the most powerful, adding the concept of promotion, and demotion. -More information on creating OCF agents can be found in the upstream [developer guide](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc). +Resource agents must conform to one of a variety of standards, including systemd, SYS-V, and OCF. The latter is the most powerful, adding the concept of promotion, and demotion. More information on +creating OCF agents can be found in the upstream [developer guide](https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc). Tools for extracting support information (must-gather tarballs) will be updated to gather relevant logs for triaging issues. -As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for signaling systemd to change their active states. -The etcd containers are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. +As part of the fencing setup, the cri-o and kubelet services will still be owned by systemd when running under pacemaker. The main difference is that the resource agent will be responsible for +signaling systemd to change their active states. The etcd containers are different in this respect since they will be restarted using Podman, but this will be running as root, as it was under CEO. #### The Fencing Network -The [goals](#goals) section of this proposal highlights the intent of minimizing the scenarios that require manual administrator intervention and maximizing the cluster's -similarity to the experience of running a 3-node hyperconverged cluster - especially in regard to auto-recovering from the failure of either node. The key to delivering on these goals is fencing. +The [goals](#goals) section of this proposal highlights the intent of minimizing the scenarios that require manual administrator intervention and maximizing the cluster's similarity to the experience +of running a 3-node hyperconverged cluster - especially in regard to auto-recovering from the failure of either node. The key to delivering on these goals is fencing. In order be able to leverage fencing, two conditions must be met: 1. The fencing network must be available 2. Fencing must be properly configured -Because of the criticality of the fencing network being available, the fencing network should be isolated from main network used by the cluster. This ensures that a network -disruption experienced between the nodes on the main network can still proceed with a recovery operation with an available fencing network. +Because of the criticality of the fencing network being available, the fencing network should be isolated from main network used by the cluster. This ensures that a network disruption experienced +between the nodes on the main network can still proceed with a recovery operation with an available fencing network. -Addressing the matter of ensuring that fencing is configured properly is more challenging. While the installer should protect from some kinds of invalid configuration -(e.g. an incorrect BMC password), this is not a guarantee that the configuration is actually valid. For example, a user could accidentally enter the BMC credentials for -a different server altogether, and pacemaker would be none-the-wiser. The only way to ensure that fencing is configured properly is to test it. +Addressing the matter of ensuring that fencing is configured properly is more challenging. While the installer should protect from some kinds of invalid configuration (e.g. an incorrect BMC password), +this is not a guarantee that the configuration is actually valid. For example, a user could accidentally enter the BMC credentials for a different server altogether, and pacemaker would be +none-the-wiser. The only way to ensure that fencing is configured properly is to test it. ##### Fencing Health Check -To test that fencing is operating as intended in an installed cluster, a special fencing-verification Day-2 operation will be made available to users. At first, this would -exist as a Fencing Health Check script that is run to crash each of the nodes in turn, waiting for the cluster to recover between crash events. As a longer term objective, it -might make sense to explore if this could be integrated into the OpenShift client. It is recommended that cluster administrators run the Fencing Health Check operation -regularly for TNF clusters. A future improvement could add reminders for cluster administrators who haven't run the Fencing Health Check in the last 3 or 6 months. +To test that fencing is operating as intended in an installed cluster, a special fencing-verification Day-2 operation will be made available to users. At first, this would exist as a Fencing Health +Check script that is run to crash each of the nodes in turn, waiting for the cluster to recover between crash events. As a longer term objective, it might make sense to explore if this could be +integrated into the OpenShift client. It is recommended that cluster administrators run the Fencing Health Check operation regularly for TNF clusters. A future improvement could add reminders for +cluster administrators who haven't run the Fencing Health Check in the last 3 or 6 months. #### Platform `none` vs. `baremetal` -One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. -The bottom line is that it will take some engineering effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift. +One of the major design questions of two-node OpenShift is whether to target support for `platform: none` or `platform: baremetal`. The advantage of selecting `platform: baremetal` is that we can +leverage the benefits of deploying an ingress-VIP out of the box using keepalived and haproxy. After some discussion with the metal networking team, it is expected that this might work without +modifications as long as pacemaker fencing doesn't remove nodes from the node list so that both keepalived instances are always peers. Furthermore, it was noted that this might be solved more simply +without keepalived at all by using the ipaddr2 resource agent for pacemaker to run the `ip addr add` and `ip addr remove` commands for the VIP. The bottom line is that it will take some engineering +effort to modify the out-of-the-box in-cluster networking feature for two-node OpenShift. -Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the `platform: baremetal` BMC entries would be loaded into `BareMetalHost` CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Bare Metal Platform team, we were advised against using the Bare Metal Operator as an inventory. Its purpose/scope is provisioning nodes. +Outside of potentially reusing the networking bits of `platform: baremetal`, we discussed potentially reusing its API for collecting BMC credentials for fencing. In this approach, we'd use the +`platform: baremetal` BMC entries would be loaded into `BareMetalHost` CRDs and we'd extend BMO to initialize pacemaker instead of a new operator. After a discussion with the Bare Metal Platform team, +we were advised against using the Bare Metal Operator as an inventory. Its purpose/scope is provisioning nodes. -This means that the Baremetal Operator is not initially in scope for a two-node cluster because we don't intend to support compute nodes. However, if this requirement were to change for future business opportunities, it may still be useful to provide the user with an install-time option for deploying the Baremetal Operator. +This means that the Baremetal Operator is not initially in scope for a two-node cluster because we don't intend to support compute nodes. However, if this requirement were to change for future +business opportunities, it may still be useful to provide the user with an install-time option for deploying the Baremetal Operator. -Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both `platform: none` and platform `platform: baremetal` clusters. +Given the likelihood of customers wanting flexibility over the footprint and capabilities of the platform operators running on the cluster, the safest path forward is to target TNF clusters on both +`platform: none` and platform `platform: baremetal` clusters. -For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work with the Metal Networking team to prioritize this as a feature for this platform in the future. +For `platform: none` clusters, this will require customers to provide an ingress load balancer. That said, if in-cluster networking becomes a feature customers request for `platform: none` we can work +with the Metal Networking team to prioritize this as a feature for this platform in the future. #### Graceful vs. Unplanned Reboots -Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. This includes events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - the node that is shutting down must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster and sync its database to the active node. +Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. This includes +events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - the node that is shutting down +must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster and sync its database to the active node. -Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new cluster before it can rejoin as an active peer. +Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power +outages, or turning off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery +involves pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the +new cluster before it can rejoin as an active peer. #### Failure Scenario Timelines: This section provides specific steps for how two-node clusters would handle interesting events. @@ -514,8 +552,7 @@ This section provides specific steps for how two-node clusters would handle inte 2. Network Failure 1. Corosync on both nodes detects separation 2. Internal quorum for etcd (E-quorum) and goes read-only - 3. Both sides retain C-quorum and initiate fencing of the other side. - RHEL-HA's fencing priority avoids parallel fencing operations and thus the total shutdown of the system. + 3. Both sides retain C-quorum and initiate fencing of the other side. RHEL-HA's fencing priority avoids parallel fencing operations and thus the total shutdown of the system. 4. One side wins, pre-configured as Node1 5. Pacemaker on Node1 restarts etcd forcing a new cluster with old state to recover E-quorum. Node2 is added to etcd members list as learning member. 6. Cluster continues with no redundancy @@ -556,7 +593,8 @@ This section provides specific steps for how two-node clusters would handle inte 8. … time passes … 9. Node1 Power restored 10. Node1 boots but can not gain quorum before Node2 joins the cluster due to a risk of fencing loop - * Mitigation (Phase 1): manual intervention (possibly a script) in case the admin can guarantee Node2 is down, which will grant Node1 quorum and restore cluster limited (none HA) functionality. + * Mitigation (Phase 1): manual intervention (possibly a script) in case the admin can guarantee Node2 is down, which will grant Node1 quorum and restore cluster limited (none HA) + functionality. * Mitigation (Phase 2): limited automatic intervention for some use cases: for example, Node1 will gain quorum only if Node2 can be verified to be down by successfully querying its BMC status. 5. Kubelet Failure 1. Pacemaker’s monitoring detects the failure @@ -572,11 +610,14 @@ This section provides specific steps for how two-node clusters would handle inte #### Running Two Node OpenShift with Fencing with a Failed Node An interesting aspect of TNF is that should a node fail and remain in a failed state, the cluster recovery operation will allow the survivor to restart etcd as a cluster-of-one and resume normal -operations. In this state, we have an operation running cluster with a single control plane node. So, what is the difference between this state and running Single Node OpenShift? There are three key aspects: +operations. In this state, we have an operation running cluster with a single control plane node. So, what is the difference between this state and running Single Node OpenShift? There are three key +aspects: 1. Operators that deploy to multiple nodes will become degraded. 2. Operations that would violate pod-disruption budgets will not work. -3. Lifecycle operations that would violate the `MaxUnavailable` setting of the control-plane [MachineConfigPool](https://docs.openshift.com/container-platform/4.17/updating/understanding_updates/understanding-openshift-update-duration.html#factors-affecting-update-duration_openshift-update-duration) cannot proceed. This includes MCO node reboots and cluster upgrades. +3. Lifecycle operations that would violate the `MaxUnavailable` setting of the control-plane + [MachineConfigPool](https://docs.openshift.com/container-platform/4.17/updating/understanding_updates/understanding-openshift-update-duration.html#factors-affecting-update-duration_openshift-update-duration) + cannot proceed. This includes MCO node reboots and cluster upgrades. In short - it is not recommended that users allow their clusters to remain in this semi-operational state longterm. It is intended help ensure that api-server and workloads are available as much as possible, but it is not sufficient for the operation of a healthy cluster longterm. @@ -598,7 +639,9 @@ The only boundary we'd set to is declare compute nodes as unsupported in documen #### Hypershift / Hosted Control Planes -This topology is anti-synergistic with HyperShift. As the management cluster, a cost-sensitive control-plane runs counter to the the proposition of highly-scaleable hosted control-planes since your compute resources are limited. As the hosted cluster, the benefit of hypershift is that your control-planes are running as pods in the management cluster. Reducing the number of instances of control-plane nodes would trade the minimal cost of a third set of control-plane pods at the cost of having to implement fencing between your control-plane pods. +This topology is anti-synergistic with HyperShift. As the management cluster, a cost-sensitive control-plane runs counter to the the proposition of highly-scaleable hosted control-planes since your +compute resources are limited. As the hosted cluster, the benefit of hypershift is that your control-planes are running as pods in the management cluster. Reducing the number of instances of +control-plane nodes would trade the minimal cost of a third set of control-plane pods at the cost of having to implement fencing between your control-plane pods. #### Single-node Deployments or MicroShift @@ -622,13 +665,14 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i 3. Mitigation: The node will be reachable via SSH and the confirmation can be scripted 4. Mitigation: It may be possible to identify scenarios where, for a known hardware topology, it is safe to allow the node to proceed automatically. -5. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation to promptly act on that disappearance by power cycling them. - 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. - Most OpenShift-trigger events, such as upgrades and MCO-triggered restarts, should follow the logic described above for graceful reboots, which should result in minimal disruption. +5. Risk: “Something changed, let's reboot” is somewhat baked into OCP’s DNA and has the potential to be problematic when nodes are actively watching for their peer to disappear, and have an obligation + to promptly act on that disappearance by power cycling them. + 1. Mitigation: Identify causes of reboots, and either avoid them or ensure they are not treated as failures. Most OpenShift-trigger events, such as upgrades and MCO-triggered restarts, should + follow the logic described above for graceful reboots, which should result in minimal disruption. 6. Risk: We may not succeed in identifying all the reasons a node will reboot - 1. Mitigation: So far, we've classified all failure events on this topology as belonging to the expected (i.e. graceful) reboot flow, or the unexpected (i.e. ungraceful) - reboot flow. Should a third classification emerge during testing, we will expand out test to include this and enumerate the events that fall into that flow. + 1. Mitigation: So far, we've classified all failure events on this topology as belonging to the expected (i.e. graceful) reboot flow, or the unexpected (i.e. ungraceful) reboot flow. Should a third + classification emerge during testing, we will expand out test to include this and enumerate the events that fall into that flow. 7. Risk: This new platform will have a unique installation flow 1. Mitigation: A new CI lane will be created for this topology @@ -639,20 +683,17 @@ This proposal is an alternative architecture to Single-node and MicroShift, so i ### Drawbacks -The two-node architecture represents yet another distinct install type for users to choose from, and therefore another addition to the test matrix -for bare-metal installation variants. Because this topology has so many unique failure recovery paths, it also requires an in-depth new test -suite which can be used to exercise all of these failure recovery scenarios. +The two-node architecture represents yet another distinct install type for users to choose from, and therefore another addition to the test matrix for bare-metal installation variants. Because this +topology has so many unique failure recovery paths, it also requires an in-depth new test suite which can be used to exercise all of these failure recovery scenarios. -More critically, this is the only variant of OpenShift that would recommend a regular maintenance check to ensure that failures that require -fencing result in automatic recovery. Conveying the importance of a regular disaster recovery readiness checks will be an interesting challenge -for the user experience. +More critically, this is the only variant of OpenShift that would recommend a regular maintenance check to ensure that failures that require fencing result in automatic recovery. Conveying the +importance of a regular disaster recovery readiness checks will be an interesting challenge for the user experience. -The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. -Satisfying this demand would come with significant technical and support overhead which is out of scope for this enhancement. +The existence of 1, 2, and 3+ node control-plane sizes will likely generate customer demand to move between them as their needs change. Satisfying this demand would come with significant technical and +support overhead which is out of scope for this enhancement. ## Open Questions [optional] -1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? - How can these be avoided? +1. Are there any normal lifecycle events that would be interpreted by a peer as a failure, and where the resulting "recovery" would create unnecessary downtime? How can these be avoided? So far, we haven't found any of these. Normal lifecycle events can be handled cleanly through a the graceful recovery flow. @@ -660,33 +701,31 @@ Satisfying this demand would come with significant technical and support overhea 3. Can we do pacemaker initialization without the introduction of a new operator? - We've talked over the pros and cons of a new operator to handle aspects of the TNF setup. The primary job of a TNF setup operator would be to initialize pacemaker and - to ensure that it reaches a healthy state. This becomes a simple way of kicking off the transition from CEO controlled etcd to RHEL-HA controlled etcd. As an operator, - it can also degrade during installation to ensure that installation fails if fencing credentials are invalid or the etcd containers cannot be started. The last benefit is - that the operator could later be used to communicate information about pacemaker to a cluster admin in case the resource and/or fencing agents become unhealthy. + We've talked over the pros and cons of a new operator to handle aspects of the TNF setup. The primary job of a TNF setup operator would be to initialize pacemaker and to ensure that it reaches a + healthy state. This becomes a simple way of kicking off the transition from CEO controlled etcd to RHEL-HA controlled etcd. As an operator, it can also degrade during installation to ensure that + installation fails if fencing credentials are invalid or the etcd containers cannot be started. The last benefit is that the operator could later be used to communicate information about pacemaker + to a cluster admin in case the resource and/or fencing agents become unhealthy. - After some discussion, we're prioritizing an exploration of a solution to this initialization without introducing a new operator. The operator that is closest in scope - to pacemaker initialization is the cluster-etcd-operator. Ideally, we could have it be responsible for kicking off the initialization of pacemaker, since the core of a - successful TNF setup is to ensure etcd ownership is transitioned to a healthy RHEL-HA deployment. While it is a little unorthodox for a core operator to initialize an external - component, that component is tightly coupled with the health of etcd to begin with and they benefit from being deployed and tested together. Additionally, most cases that - would result in pacemaker failing to initialize would result in CEO being degraded as well. One concern raised for this approach is that we may introduce a greater security - risk since CEO permissions need to be elevated so that a container can run as root to initialize pacemaker. The other challenge to solve with this approach is how we - communicate problems discovered by pacemaker to the user. + After some discussion, we're prioritizing an exploration of a solution to this initialization without introducing a new operator. The operator that is closest in scope to pacemaker initialization + is the cluster-etcd-operator. Ideally, we could have it be responsible for kicking off the initialization of pacemaker, since the core of a successful TNF setup is to ensure etcd ownership is + transitioned to a healthy RHEL-HA deployment. While it is a little unorthodox for a core operator to initialize an external component, that component is tightly coupled with the health of etcd to + begin with and they benefit from being deployed and tested together. Additionally, most cases that would result in pacemaker failing to initialize would result in CEO being degraded as well. One + concern raised for this approach is that we may introduce a greater security risk since CEO permissions need to be elevated so that a container can run as root to initialize pacemaker. The other + challenge to solve with this approach is how we communicate problems discovered by pacemaker to the user. 4. How do we notify the user of problems found by pacemaker? - Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to - a cluster admin? A simple example of this would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this - error to the cluster admin, such that they can see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a TNF setup - operator, this could be one of the ongoing functions of this operator. In our current design, we'd likely need to explore what kinds of errors we can bubble up through - existing cluster health APIs to see if something suitable can be reused. + Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to a cluster admin? A simple + example of this would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this error to the cluster admin, such that they can + see that their cluster may be at risk of failure if no action is taken to resolve the problem? If we introduce a TNF setup operator, this could be one of the ongoing functions of this operator. In + our current design, we'd likely need to explore what kinds of errors we can bubble up through existing cluster health APIs to see if something suitable can be reused. 5. How do we handle updates to the etcd container? - Things like certificate rotations and image updates will necessitate updates to the pacemaker-controlled etcd container. We will need to introduce some kind of mechanism - where CEO can describe the changes that need to happen and trigger an image update. We might be able to leverage - [podman play kube](https://docs.podman.io/en/v4.2/markdown/podman-play-kube.1.html) to map the static pod definition to a container, but we will need to find - a way to get CEO to render what would usually be the contents of the static pod config to somewhere pacemaker can see updates and respond to them. + Things like certificate rotations and image updates will necessitate updates to the pacemaker-controlled etcd container. We will need to introduce some kind of mechanism where CEO can describe the + changes that need to happen and trigger an image update. We might be able to leverage [podman play kube](https://docs.podman.io/en/v4.2/markdown/podman-play-kube.1.html) to map the static pod + definition to a container, but we will need to find a way to get CEO to render what would usually be the contents of the static pod config to somewhere pacemaker can see updates and respond to + them. ## Test Plan @@ -730,8 +769,8 @@ This section outlines test scenarios for TNF. | Scalability | Scalability metrics are gathered and compared to SNO and Compact HA | | TNF Recovery Suite | Ensures all of the 2-node specific behaviors are working as expected | -As noted above, there is an open question about how layered products should be treated in the test plan. -Additionally, it would be good to have workload-specific testing once those are defined by the workload proposal. +As noted above, there is an open question about how layered products should be treated in the test plan. Additionally, it would be good to have workload-specific testing once those are defined by the +workload proposal. ## Graduation Criteria @@ -766,66 +805,61 @@ Additionally, it would be good to have workload-specific testing once those are ## Upgrade / Downgrade Strategy -This topology has the same expectations for upgrades as the other variants of OpenShift. -For tech preview, upgrades will only be achieved by redeploying the machine and its workload. However, -fully automated upgrades are a requirement for graduating to GA. +This topology has the same expectations for upgrades as the other variants of OpenShift. For tech preview, upgrades will only be achieved by redeploying the machine and its workload. However, fully +automated upgrades are a requirement for graduating to GA. -One key detail about upgrades is that they will only be allowed to proceed when both nodes are healthy. -The main challenge with upgrading a 2-node cluster is ensuring the cluster stays functional and consistent -through the reboots of the upgrade. This can be achieved by setting the `maxUnavailable` machines in the -control-plane MachineConfigPool to 1. +One key detail about upgrades is that they will only be allowed to proceed when both nodes are healthy. The main challenge with upgrading a 2-node cluster is ensuring the cluster stays functional and +consistent through the reboots of the upgrade. This can be achieved by setting the `maxUnavailable` machines in the control-plane MachineConfigPool to 1. Downgrades are not supported outside of redeployment. ## Version Skew Strategy -The biggest concern with version skew would be incompatibilities between a new version of pacemaker and the currently running resource agents. -Upgrades will not atomically replace both the RPM and the resource agent configuration, not are there any guarantees that both nodes will be running -the same versions. It's difficult to imagine a case where such an incompatibility wouldn't be caught during upgrade testing. This will be something to keep -a close eye on when evaluating upgrade jobs for potential race conditions. +The biggest concern with version skew would be incompatibilities between a new version of pacemaker and the currently running resource agents. Upgrades will not atomically replace both the RPM and the +resource agent configuration, not are there any guarantees that both nodes will be running the same versions. It's difficult to imagine a case where such an incompatibility wouldn't be caught during +upgrade testing. This will be something to keep a close eye on when evaluating upgrade jobs for potential race conditions. ## Operational Aspects of API Extensions -- For conversion/admission webhooks and aggregated API servers: what are the SLIs (Service Level - Indicators) an administrator or support can use to determine the health of the API extensions +- For conversion/admission webhooks and aggregated API servers: what are the SLIs (Service Level Indicators) an administrator or support can use to determine the health of the API extensions N/A -- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, - API availability) +- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, API availability) - Toggling CEO control values with result in etcd being briefly offline. The transition is almost immediate, though, since the resource agent is watching for the - etcd container to disappear so it can start its replacement. + Toggling CEO control values with result in etcd being briefly offline. The transition is almost immediate, though, since the resource agent is watching for the etcd container to disappear so it can + start its replacement. The other potential impact is around reboots. There may be a small performance impact when the nodes reboot since they have to leave the etcd cluster and resync etcd to join. -- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review this enhancement) +- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review this + enhancement) The impact of the etcd transition as well as the reboot interactions with etcd are likely compatible with existing SLIs. - Describe the possible failure modes of the API extensions. - There shouldn't be any failures introduced by adding a new topology. - Transitioning etcd from CEO-managed to externally managed should be a one-way switch, verified by a ValidatingAdmissionPolicy. If it fails, it should result in a failed installation. - This is also true of the initial fencing setup. If pacemaker cannot be initialized, the cluster installation should ideally fail. - If the BMC access starts failing later in the cluster lifecycle and the administrator fails to remediate this before a fencing operation is required, then manual recovery will be required by SSH-ing to the node. + There shouldn't be any failures introduced by adding a new topology. Transitioning etcd from CEO-managed to externally managed should be a one-way switch, verified by a ValidatingAdmissionPolicy. If + it fails, it should result in a failed installation. This is also true of the initial fencing setup. If pacemaker cannot be initialized, the cluster installation should ideally fail. If the BMC + access starts failing later in the cluster lifecycle and the administrator fails to remediate this before a fencing operation is required, then manual recovery will be required by SSH-ing to the + node. -- Describe how a failure or behaviour of the extension will impact the overall cluster health - (e.g. which kube-controller-manager functionality will stop working), especially regarding - stability, availability, performance, and security. +- Describe how a failure or behaviour of the extension will impact the overall cluster health (e.g. which kube-controller-manager functionality will stop working), especially regarding stability, + availability, performance, and security. - As mentioned above, a network outage or a BMC credential rotation could result in a customer "breaking" pacemaker's ability to fence nodes. On its own, pacemaker has no way of communicating this kind of failure to the cluster admin. Whether this can be raised to OpenShift via monitoring or something similar remains an open question. + As mentioned above, a network outage or a BMC credential rotation could result in a customer "breaking" pacemaker's ability to fence nodes. On its own, pacemaker has no way of communicating this + kind of failure to the cluster admin. Whether this can be raised to OpenShift via monitoring or something similar remains an open question. -- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes - and add them as reviewers to this enhancement. +- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes and add them as reviewers to this enhancement. - In case of escalation, the most likely team affected outside of Edge Enablement (or whoever owns the proposed topology) is the Control Plane team because etcd is the primary component that two-node OpenShift needs to manage properly to protect against data corruption/loss. + In case of escalation, the most likely team affected outside of Edge Enablement (or whoever owns the proposed topology) is the Control Plane team because etcd is the primary component that two-node + OpenShift needs to manage properly to protect against data corruption/loss. ## Support Procedures - Failure logs for pacemaker will be available in the system journal. The installer should report these in the case that a cluster cannot successfully initialize pacemaker. -- A BMC connection failure detected after the cluster is installed can be remediated as long as the cluster is healthy. A new MachineConfig can be applied to update the secrets file. If the cluster - is down, this file would need to be updated manually. +- A BMC connection failure detected after the cluster is installed can be remediated as long as the cluster is healthy. A new MachineConfig can be applied to update the secrets file. If the cluster is + down, this file would need to be updated manually. - In the case of a failed two-node cluster, there is no supported way of migrating to a different topology. The most practical option would be to deploy a fresh environment. ## Alternatives @@ -851,12 +885,14 @@ synchronised state between the OpenShift instances. Solving the state synchroniz actively benefiting from the computation power of the second server. #### MicroShift -MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the TNF initiative which is on getting the OpenShift experience on two nodes +MicroShift was considered as an alternative but it was ruled out because it does not support multi-node and has a very different experience than OpenShift which does not match the TNF initiative which +is on getting the OpenShift experience on two nodes #### 2 SNO + KCP: -[KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. -With kcp, you can manage the two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. -The main advantage of this approach is that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. +[KCP](https://github.com/kcp-dev/kcp/) allows you to manage multiple clusters from a single control-plane, reducing the complexity of managing each cluster independently. With kcp, you can manage the +two single-node clusters, each single-node OpenShift cluster can continue to operate independently even if the central kcp management plane becomes unavailable. The main advantage of this approach is +that it doesn’t require inventing a new Openshift flavor and we don’t need to create a new installation flow to accommodate it. + Disadvantages: * Production readiness * KCP itself could become a single point of failure (need to configure pacemaker to manage KCP) From 0b1de869d4e5d51ec19a551c9fd3399c9f5e6ac6 Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Thu, 13 Feb 2025 14:24:14 +0200 Subject: [PATCH 49/49] Note about how we'd manage pacemaker configuration Signed-off-by: Michael Shitrit --- enhancements/two-node-fencing/tnf.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/enhancements/two-node-fencing/tnf.md b/enhancements/two-node-fencing/tnf.md index 58ff02a502..587ccc0a36 100644 --- a/enhancements/two-node-fencing/tnf.md +++ b/enhancements/two-node-fencing/tnf.md @@ -138,7 +138,7 @@ At a glance, here are the components we are proposing to change: | [ETCD Operator](#etcd-operator-changes) | Add mode for disabling management of the etcd container, scaling strategy for 2 nodes, and a controller for initializing pacemaker | | [Install Config](#install-config-changes) | Update install config API to accept fencing credentials for `platform: None` and `platform: Baremetal` | | [Installer](#installer-changes) | Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter | -| [MCO](#mco-changes) | Add an MCO extension for installating pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 | +| [MCO](#mco-changes) | Add an MCO extension for installing pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 | | [Authentication Operator](#authentication-operator-changes) | Update operator to accept minimum 1 kube api servers when `ControlPlaneTopology` is `DualReplica` | | [Hosted Control Plane](#hosted-control-plane-changes) | Disallow HyperShift from installing on the `DualReplica` topology | | [OLM Filtering](#olm-filtering-changes) | Leverage support for OLM to filter operators based off of control plane topology | @@ -181,6 +181,13 @@ An important facility of the installation flow is the transition from a CEO depl 4. The installation proceeds as normal once the containers start. If for some reason, the etcd containers cannot be started, then the installation will fail. The installer will pull logs from the control-plane nodes to provide context for this failure. +###### Managing Pacemaker and Resource/Fence agent Configuration +Pacemaker configurations (as well as its resource and fence agent configurations) do not need to be stored as files, as they are dynamically created using pcs commands. Instead, the in-cluster entity will handle triggering these commands as needed. + +For the initial Technical Preview phase, we will use default values for these configurations, except for fencing configurations (which are covered in the next section). + +Looking ahead to General Availability (GA), we are considering a mapping solution that will allow users to specify certain values, which will then be used to generate the configurations dynamically. + ###### Configuring Fencing Via MCO Fencing setup is the last important aspect of the cluster installation. For the cluster installation to be successful, fencing should be configured and active before we declare the installation successful. To do this, baseboard management console (BMC) credentials need to be made available to the control-plane nodes as part of pacemaker initialization. To ensure rapid fencing using pacemaker, we will collect RedFish details (address, username, and **password**) for each node via the install-config (see proposed install-config changes).