Skip to content

Commit 7c6c4c2

Browse files
authored
Auto failover features for Morpheus (#3854)
Full draft of the docs for: * Allow Autofailover for ephemeral bucket w/o relica feature doc (DOC-12191) * Support auto-failover for exceptionally slow/hanging disks (DOC-12073) Manually ported over changes from prior branch because attempts to merge resulted in huge numbers of conflicts potentially with Supritha's changes to the underlying docs.
1 parent 6fa1ea5 commit 7c6c4c2

File tree

7 files changed

+373
-136
lines changed

7 files changed

+373
-136
lines changed

modules/introduction/partials/new-features-80.adoc

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,30 @@ curl --get -u <username:password> \
8989
-d clusterLabels=none|uuidOnly|uuidAndName
9090
----
9191
92+
https://jira.issues.couchbase.com/browse/MB-33315[MB-33315] Allow auto-failover for ephemeral buckets without a replica::
93+
Previously, Couchbase Server always prevented auto-failover on nodes containing an ephemeral bucket that does not have replicas.
94+
You can now configure Couchbase Server Enterprise Edition to allow a node to auto-failover even if it has an ephemeral bucket without a replica.
95+
You can enable this setting using the Couchbase Server Web Console or through the REST API using the `allowFailoverEphemeralNoReplicas` auto-failover setting.
96+
This option defaults to off.
97+
When you enable it, Couchbase Server creates empty vBuckets on other nodes to replace the lost ephemeral vBuckets on the failed over node.
98+
If the failed over node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to rejoining node.
99+
This option is useful if your application uses ephemeral buckets for data that's not irreplaceable, such as caches.
100+
This setting is not available in Couchbase Server Community Edition.
101+
102+
+
103+
See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] and xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] for more information.
104+
105+
https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks::
106+
You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging.
107+
Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time.
108+
The new `failoverOnDataDiskNonResponsiveness` setting and corresponding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write operations to complete.
109+
If this period elapses before the operation completes, Couchbase Server triggers an auto-failover for the node.
110+
This setting is off by default.
111+
112+
+
113+
See xref:learn:clusters-and-availability/automatic-failover.adoc#failover-on-data-disk-non-responsiveness[Failover on Data Disk Non-Responsiveness] to learn more about this feature.
114+
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] and xref:rest-api:rest-cluster-autofailover-enable.adoc[] to learn how to enable it.
115+
92116
https://jira.issues.couchbase.com/browse/MB-65779[MB-65779]::
93117
Couchbase supports the REST API `DELETE pools/default/settings/memcached/global/setting/[setting_name]` for some of the settings that are not always passed from the Cluster Manager to memcached.
94118
+
@@ -111,6 +135,7 @@ These are the services that can be modified:
111135
112136
You can modify these services using the Couchbase xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-from-ui[UI], xref:rest-api:rest-set-up-services-existing-nodes.adoc[REST API], or xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-using-cli[CLI].
113137
138+
114139
[#section-new-feature-data-service]
115140
=== Data Service
116141

modules/learn/pages/buckets-memory-and-storage/buckets.adoc

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,9 +203,15 @@ You can add or remove buckets and nodes dynamically.
203203
| By default, auto-failover starts when a node is inaccessible for 120 seconds.
204204
Auto-failover can occur up to a specified maximum number of times before you must reset it manually.
205205
When a failed node becomes accessible again, delta-node recovery uses data on disk and resynchronizes it.
206-
| Auto-reprovision starts as soon as a node is inaccessible.
206+
| Auto-reprovision starts for ephemeral buckets with replicas on a failed node as soon as a node is inaccessible.
207207
Auto-reprovision can occur multiple times for multiple nodes.
208208
When a failed node becomes accessible again, the system does not require delta-node recovery because no data resides on disk.
209+
210+
If you enable auto-failover for ephemeral buckets without replicas, a failed node can Auto-failover.
211+
In this case, when a failover occurs Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
212+
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.
213+
214+
NOTE: the auto-failover for ephemeral buckets feature is only available on Couchbase Server Enterprise Edition.
209215
|===
210216

211217
== Bucket Security

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

Lines changed: 75 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ For information on managing auto-failover, see the information provided for Couc
3333
== Failover Events
3434

3535
Auto-failover occurs in response to failed/failing events.
36-
There are three types of event that can trigger auto-failover:
36+
The following events can trigger auto-failover:
3737

3838
* _Node failure_.
3939
A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.
@@ -42,7 +42,16 @@ A server-node within the cluster is unresponsive (due to a network failure, very
4242
Concurrent correlated failure of multiple nodes such as physical rack of machines or multiple virtual machines sharing a host.
4343

4444
* _Data Service disk read/write issues_.
45-
Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.
45+
Data Service disk read/write errors.
46+
Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.
47+
48+
[#failover-on-data-disk-non-responsiveness]
49+
* _Data Disk non-responsiveness_.
50+
You can configure a timeout period for the Data Service's disk read/write threads to complete an operation.
51+
When you enable this setting, if the period elapses and the thread has not completed the operation, Couchbase Server can auto-fail over the node.
52+
This setting differs from the disk error timeout because the data disk does not have to return errors.
53+
Instead, if the disk is so slow that it cannot complete the operation or is hanging, Couchbase Server can take action even when it does not receive an error.
54+
4655

4756
* _Index or Data Service running on the mode is non-responsive or unhealthy_.
4857
** Index Service non-responsiveness.
@@ -61,7 +70,12 @@ If a monitored or configured auto-failover event occurs, an auto-failover will n
6170

6271
The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of some of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another node) in a third physical server group which will allow the remaining nodes to form a quorum.
6372

64-
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.
73+
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required.
74+
If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group.
75+
You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is.
76+
The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset.
77+
Running a rebalance will reset the count value back to 0.
78+
The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.
6579

6680
The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred.
6781

@@ -72,8 +86,10 @@ For example, given a cluster of 18 nodes, _10_ nodes are required for the quorum
7286
After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed.
7387
Note, however, that the count can be manually reset, or a rebalance performed, prior to the maximum number being reached.
7488

75-
* In no circumstances where data-loss might result: for example, when a bucket has no replicas.
76-
Therefore, even a single event may not trigger a response; and an administrator-specified maximum number of failed nodes may not be reached.
89+
* By default, Couchbase Server does not allow an auto-failover if it may result in data loss.
90+
For example, with default settings Couchbase Server does not allow the auto-failover of a node that contains a bucket with no replicas.
91+
This restriction includes ephemeral buckets as well as Couchbase buckets.
92+
See <<#auto-failover-and-ephemeral-buckets>> for more information on auto-failover and ephemeral buckets.
7793

7894
* Only in accordance with the xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] for the service or services on the unresponsive node.
7995

@@ -89,7 +105,58 @@ Auto-failover is for intra-cluster use only: it does not work with xref:learn:cl
89105
See xref:manage:manage-settings/configure-alerts.adoc[Alerts], for
90106
details on configuring email alerts related to failover.
91107

92-
See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information on server groups.
108+
See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information about server groups.
109+
110+
[#auto-failover-and-ephemeral-buckets]
111+
== Auto-Failover and Ephemeral Buckets
112+
Couchbase Server supports ephemeral buckets, which are buckets that it stores only in memory.
113+
Their data is never persisted to disk.
114+
This lack of persistence poses several challenges when it comes to node failure.
115+
116+
If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts.
117+
To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket.
118+
In this case, you must manually fail over the node if it's unresponsive.
119+
However, all of the ephemeral bucket's data on the node is lost.
120+
121+
Couchbase Server provides two settings that affect how node failures work with ephemeral buckets:
122+
123+
Auto-reprovisioning for Ephemeral Buckets::
124+
This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it.
125+
This setting defaults to enabled.
126+
When it's enabled, Couchbase Server automatically activates the replicas of any ephemeral vBuckets that are active on the restarting node.
127+
If you turn off this setting, there's a risk that the restarting node could cause data loss.
128+
It could roll back asynchronous writes that the replica vBuckets have but its vBuckets are missing.
129+
130+
[#ephemeral-buckets-with-no-replicas]
131+
Auto-failover for Ephemeral Buckets with No Replicas [.edition]#{enterprise}#::
132+
When enabled, this setting allows Couchbase Server to auto-failover a node that contains vBuckets for an ephemeral bucket with no replicas.
133+
When Couchbase Server fails over a node with an unreplicated ephemeral bucket, the data in the vBuckets on the node is lost.
134+
Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
135+
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.
136+
137+
+
138+
This setting is off by default.
139+
When it's off, Couchbase Server does not auto-failover a node that contains an unreplicated ephemeral bucket's vBuckets.
140+
If one of these nodes becomes unresponsive, you must manually fail over the node.
141+
142+
+
143+
Enable this setting when preserving the data in the ephemeral bucket is not critical for your application.
144+
For example, suppose you use the unreplicated ephemeral bucket for caching data.
145+
In this case, consider enabling this setting to allow Couchbase Server to auto-failover nodes containing its vBuckets.
146+
Losing the data in the cache is not critical, because your application can repopulate the cache with minimal performance cost.
147+
148+
+
149+
NOTE: If the data in the ephemeral bucket is critical for your application, enable one or more replicas for it.
150+
See xref:manage:manage-buckets/create-bucket.adoc#ephemeral-bucket-settings[Ephemeral Bucket Settings] for more information about adding replicas for an ephemeral bucket.
151+
152+
+
153+
If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node.
154+
After this type of failover, the index must be rebuilt because it indexes data lost in the failed node's vBuckets.
155+
For more information, see xref:learn:services-and-indexes/indexes/index-replication.adoc#index-rollback-after-failover[Index Rollback After Failover].
156+
157+
+
158+
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] to learn how to change these settings via the Couchbase Server Web Console.
159+
See xref:rest-api:rest-cluster-autofailover-settings.adoc[] for information about changing these settings via the REST API.
93160

94161
[#failover-policy]
95162
== Service-Specific Auto-Failover Policy
@@ -231,7 +298,8 @@ This parameter is available in Enterprise Edition only: in Community Edition, th
231298
* _Count_.
232299
The number of nodes that have already failed over.
233300
The default value is 0.
234-
The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Running a rebalance will reset the count value back to 0.
301+
The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0.
302+
Run a rebalance to reset the count value back to 0.
235303

236304
* _Enablement of disk-related automatic failover; with corresponding time-period_.
237305
Whether automatic failover is enabled to handle continuous read-write failures.
55.3 KB
Loading

0 commit comments

Comments
 (0)