Prevent blocked groups in stream SAC with fine-grained status (backport #13672) #14091

mergify · 2025-06-17T11:25:56Z

This is an automatic backport of pull request #13672 done by [Mergify](https://mergify.com).

A boolean status in the stream SAC coordinator is not enough to follow the evolution of a consumer. For example a former active consumer that is stepping down can go down before another consumer in the group is activated, letting the coordinator expect an activation request that will never arrive, leaving the group without any active consumer. This commit introduces 3 status: active (formerly "true"), waiting (formerly "false"), and deactivating. The coordinator will now know when a deactivating consumer goes down and will trigger a rebalancing to avoid a stuck group. This commit also introduces a status related to the connectivity state of a consumer. The possible values are: connected, disconnected, and presumed_down. Consumers are by default connected, they can become disconnected if the coordinator receives a down event with a noconnection reason, meaning the node of the consumer has been disconnected from the other nodes. Consumers can become connected again when their node joins the other nodes again. Disconnected consumers are still considered part of a group, as they are expected to come back at some point. For example there is no rebalancing in a group if the active consumer got disconnected. The coordinator sets a timer when a disconnection occurs. When the timer expires, corresponding disconnected consumers pass into the "presumed down" state. At this point they are no longer considered part of their respective group and are excluded from rebalancing decision. They are expected to get removed from the group by the appropriate down event of a monitor. So the consumer status is now a tuple, e.g. {connected, active}. Note this is an implementation detail: only the stream SAC coordinator deals with the status of stream SAC consumers. 2 new configuration entries are introduced: * rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the disconnected-to-forgotten timer. * rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands in the coordinator. It used to be a fixed value of 30 seconds. The default value is still the same. The setting has been introduced to make integration tests faster. Fixes #14070 (cherry picked from commit d1aab61)

The clean-up of a stream connection state when a stream member goes down can remove subscriptions not affected by the member. The subscription state is removed from the connection, but the subscription is not removed from the SAC state (if the subscription is a SAC), because the subscription member PID does not match the down member PID. When the actual member of the subscription goes down, the subscription is no longer part of the state, so the clean-up does not find the subscription and does not remove it from the SAC state. This lets a ghost consumer in the corresponding SAC group. This commit makes sure only the affected subscriptions are removed from the state when a stream member goes down. Fixes #13961 (cherry picked from commit a9cf049)

Calls to the stream SAC coordinator can fail for various reason (e.g. a timeout because of a network partition). The stream reader does not take into account what the SAC coordinator returns and moves on even in case of errors. This can lead to inconsistent state for SAC groups. This commit changes this behavior by handling unexpected errors from the SAC coordinator and closing the connection. The client is expected to reconnect. This is safer than risking inconsistent state. Fixes #14040 (cherry picked from commit 58f4e83)

New CLI command to trigger a rebalancing in a SAC group and activate a consumer. This is a last resort solution if all consumers in a group accidently end up in {connected, waiting} state. The command re-uses an existing function, which only picks the consumer that should be active. This means it does not try to "fix" the state (e.g. removing a disconnected consumer because its node is definitely gone from the cluster). Fixes #14055 (cherry picked from commit 41acc11)

acogoluegnes added 4 commits June 17, 2025 11:25

mergify bot assigned acogoluegnes Jun 17, 2025

mergify bot added the make label Jun 17, 2025

acogoluegnes marked this pull request as draft June 17, 2025 11:30

acogoluegnes marked this pull request as ready for review June 17, 2025 14:28

acogoluegnes merged commit 789e156 into v4.1.x Jun 17, 2025
545 of 547 checks passed

acogoluegnes deleted the mergify/bp/v4.1.x/pr-13672 branch June 17, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent blocked groups in stream SAC with fine-grained status (backport #13672) #14091

Prevent blocked groups in stream SAC with fine-grained status (backport #13672) #14091

Uh oh!

mergify bot commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Prevent blocked groups in stream SAC with fine-grained status (backport #13672) #14091

Prevent blocked groups in stream SAC with fine-grained status (backport #13672) #14091

Uh oh!

Conversation

mergify bot commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!