You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Full draft of the docs for:
* Allow Autofailover for ephemeral bucket w/o relica feature doc (DOC-12191)
* Support auto-failover for exceptionally slow/hanging disks (DOC-12073)
Manually ported over changes from prior branch because attempts to merge resulted in huge numbers of conflicts potentially with Supritha's changes to the underlying docs.
https://jira.issues.couchbase.com/browse/MB-33315[MB-33315] Allow auto-failover for ephemeral buckets without a replica::
93
+
Previously, Couchbase Server always prevented auto-failover on nodes containing an ephemeral bucket that does not have replicas.
94
+
You can now configure Couchbase Server Enterprise Edition to allow a node to auto-failover even if it has an ephemeral bucket without a replica.
95
+
You can enable this setting using the Couchbase Server Web Console or through the REST API using the `allowFailoverEphemeralNoReplicas` auto-failover setting.
96
+
This option defaults to off.
97
+
When you enable it, Couchbase Server creates empty vBuckets on other nodes to replace the lost ephemeral vBuckets on the failed over node.
98
+
If the failed over node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to rejoining node.
99
+
This option is useful if your application uses ephemeral buckets for data that's not irreplaceable, such as caches.
100
+
This setting is not available in Couchbase Server Community Edition.
101
+
102
+
+
103
+
See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] and xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] for more information.
104
+
105
+
https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks::
106
+
You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging.
107
+
Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time.
108
+
The new `failoverOnDataDiskNonResponsiveness` setting and corresponding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write operations to complete.
109
+
If this period elapses before the operation completes, Couchbase Server triggers an auto-failover for the node.
110
+
This setting is off by default.
111
+
112
+
+
113
+
See xref:learn:clusters-and-availability/automatic-failover.adoc#failover-on-data-disk-non-responsiveness[Failover on Data Disk Non-Responsiveness] to learn more about this feature.
114
+
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] and xref:rest-api:rest-cluster-autofailover-enable.adoc[] to learn how to enable it.
Couchbase supports the REST API `DELETE pools/default/settings/memcached/global/setting/[setting_name]` for some of the settings that are not always passed from the Cluster Manager to memcached.
94
118
+
@@ -111,6 +135,7 @@ These are the services that can be modified:
111
135
112
136
You can modify these services using the Couchbase xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-from-ui[UI], xref:rest-api:rest-set-up-services-existing-nodes.adoc[REST API], or xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-using-cli[CLI].
Copy file name to clipboardExpand all lines: modules/learn/pages/buckets-memory-and-storage/buckets.adoc
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -203,9 +203,15 @@ You can add or remove buckets and nodes dynamically.
203
203
| By default, auto-failover starts when a node is inaccessible for 120 seconds.
204
204
Auto-failover can occur up to a specified maximum number of times before you must reset it manually.
205
205
When a failed node becomes accessible again, delta-node recovery uses data on disk and resynchronizes it.
206
-
| Auto-reprovision starts as soon as a node is inaccessible.
206
+
| Auto-reprovision starts for ephemeral buckets with replicas on a failed node as soon as a node is inaccessible.
207
207
Auto-reprovision can occur multiple times for multiple nodes.
208
208
When a failed node becomes accessible again, the system does not require delta-node recovery because no data resides on disk.
209
+
210
+
If you enable auto-failover for ephemeral buckets without replicas, a failed node can Auto-failover.
211
+
In this case, when a failover occurs Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
212
+
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.
213
+
214
+
NOTE: the auto-failover for ephemeral buckets feature is only available on Couchbase Server Enterprise Edition.
Copy file name to clipboardExpand all lines: modules/learn/pages/clusters-and-availability/automatic-failover.adoc
+75-7Lines changed: 75 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ For information on managing auto-failover, see the information provided for Couc
33
33
== Failover Events
34
34
35
35
Auto-failover occurs in response to failed/failing events.
36
-
There are three types of event that can trigger auto-failover:
36
+
The following events can trigger auto-failover:
37
37
38
38
* _Node failure_.
39
39
A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.
@@ -42,7 +42,16 @@ A server-node within the cluster is unresponsive (due to a network failure, very
42
42
Concurrent correlated failure of multiple nodes such as physical rack of machines or multiple virtual machines sharing a host.
43
43
44
44
* _Data Service disk read/write issues_.
45
-
Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.
45
+
Data Service disk read/write errors.
46
+
Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.
47
+
48
+
[#failover-on-data-disk-non-responsiveness]
49
+
* _Data Disk non-responsiveness_.
50
+
You can configure a timeout period for the Data Service's disk read/write threads to complete an operation.
51
+
When you enable this setting, if the period elapses and the thread has not completed the operation, Couchbase Server can auto-fail over the node.
52
+
This setting differs from the disk error timeout because the data disk does not have to return errors.
53
+
Instead, if the disk is so slow that it cannot complete the operation or is hanging, Couchbase Server can take action even when it does not receive an error.
54
+
46
55
47
56
* _Index or Data Service running on the mode is non-responsive or unhealthy_.
48
57
** Index Service non-responsiveness.
@@ -61,7 +70,12 @@ If a monitored or configured auto-failover event occurs, an auto-failover will n
61
70
62
71
The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of some of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another node) in a third physical server group which will allow the remaining nodes to form a quorum.
63
72
64
-
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.
73
+
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required.
74
+
If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group.
75
+
You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is.
76
+
The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset.
77
+
Running a rebalance will reset the count value back to 0.
78
+
The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.
65
79
66
80
The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred.
67
81
@@ -72,8 +86,10 @@ For example, given a cluster of 18 nodes, _10_ nodes are required for the quorum
72
86
After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed.
73
87
Note, however, that the count can be manually reset, or a rebalance performed, prior to the maximum number being reached.
74
88
75
-
* In no circumstances where data-loss might result: for example, when a bucket has no replicas.
76
-
Therefore, even a single event may not trigger a response; and an administrator-specified maximum number of failed nodes may not be reached.
89
+
* By default, Couchbase Server does not allow an auto-failover if it may result in data loss.
90
+
For example, with default settings Couchbase Server does not allow the auto-failover of a node that contains a bucket with no replicas.
91
+
This restriction includes ephemeral buckets as well as Couchbase buckets.
92
+
See <<#auto-failover-and-ephemeral-buckets>> for more information on auto-failover and ephemeral buckets.
77
93
78
94
* Only in accordance with the xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] for the service or services on the unresponsive node.
79
95
@@ -89,7 +105,58 @@ Auto-failover is for intra-cluster use only: it does not work with xref:learn:cl
89
105
See xref:manage:manage-settings/configure-alerts.adoc[Alerts], for
90
106
details on configuring email alerts related to failover.
91
107
92
-
See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information on server groups.
108
+
See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information about server groups.
109
+
110
+
[#auto-failover-and-ephemeral-buckets]
111
+
== Auto-Failover and Ephemeral Buckets
112
+
Couchbase Server supports ephemeral buckets, which are buckets that it stores only in memory.
113
+
Their data is never persisted to disk.
114
+
This lack of persistence poses several challenges when it comes to node failure.
115
+
116
+
If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts.
117
+
To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket.
118
+
In this case, you must manually fail over the node if it's unresponsive.
119
+
However, all of the ephemeral bucket's data on the node is lost.
120
+
121
+
Couchbase Server provides two settings that affect how node failures work with ephemeral buckets:
122
+
123
+
Auto-reprovisioning for Ephemeral Buckets::
124
+
This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it.
125
+
This setting defaults to enabled.
126
+
When it's enabled, Couchbase Server automatically activates the replicas of any ephemeral vBuckets that are active on the restarting node.
127
+
If you turn off this setting, there's a risk that the restarting node could cause data loss.
128
+
It could roll back asynchronous writes that the replica vBuckets have but its vBuckets are missing.
129
+
130
+
[#ephemeral-buckets-with-no-replicas]
131
+
Auto-failover for Ephemeral Buckets with No Replicas [.edition]#{enterprise}#::
132
+
When enabled, this setting allows Couchbase Server to auto-failover a node that contains vBuckets for an ephemeral bucket with no replicas.
133
+
When Couchbase Server fails over a node with an unreplicated ephemeral bucket, the data in the vBuckets on the node is lost.
134
+
Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
135
+
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.
136
+
137
+
+
138
+
This setting is off by default.
139
+
When it's off, Couchbase Server does not auto-failover a node that contains an unreplicated ephemeral bucket's vBuckets.
140
+
If one of these nodes becomes unresponsive, you must manually fail over the node.
141
+
142
+
+
143
+
Enable this setting when preserving the data in the ephemeral bucket is not critical for your application.
144
+
For example, suppose you use the unreplicated ephemeral bucket for caching data.
145
+
In this case, consider enabling this setting to allow Couchbase Server to auto-failover nodes containing its vBuckets.
146
+
Losing the data in the cache is not critical, because your application can repopulate the cache with minimal performance cost.
147
+
148
+
+
149
+
NOTE: If the data in the ephemeral bucket is critical for your application, enable one or more replicas for it.
150
+
See xref:manage:manage-buckets/create-bucket.adoc#ephemeral-bucket-settings[Ephemeral Bucket Settings] for more information about adding replicas for an ephemeral bucket.
151
+
152
+
+
153
+
If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node.
154
+
After this type of failover, the index must be rebuilt because it indexes data lost in the failed node's vBuckets.
155
+
For more information, see xref:learn:services-and-indexes/indexes/index-replication.adoc#index-rollback-after-failover[Index Rollback After Failover].
156
+
157
+
+
158
+
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] to learn how to change these settings via the Couchbase Server Web Console.
159
+
See xref:rest-api:rest-cluster-autofailover-settings.adoc[] for information about changing these settings via the REST API.
93
160
94
161
[#failover-policy]
95
162
== Service-Specific Auto-Failover Policy
@@ -231,7 +298,8 @@ This parameter is available in Enterprise Edition only: in Community Edition, th
231
298
* _Count_.
232
299
The number of nodes that have already failed over.
233
300
The default value is 0.
234
-
The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Running a rebalance will reset the count value back to 0.
301
+
The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0.
302
+
Run a rebalance to reset the count value back to 0.
235
303
236
304
* _Enablement of disk-related automatic failover; with corresponding time-period_.
237
305
Whether automatic failover is enabled to handle continuous read-write failures.
0 commit comments