Skip to content

Prevent blocked groups in stream SAC with fine-grained status #13672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

acogoluegnes
Copy link
Contributor

No description provided.

@mergify mergify bot added the make label Apr 3, 2025
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch 4 times, most recently from 62b7b64 to 255153f Compare April 15, 2025 15:32
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch 2 times, most recently from 481472d to 6b1a27c Compare April 17, 2025 14:11
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch 4 times, most recently from fcb00f9 to b80cec2 Compare May 13, 2025 07:13
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch 3 times, most recently from 0f4d90a to a243e5b Compare May 23, 2025 06:53
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch 3 times, most recently from ad6b53a to 965ee4a Compare May 28, 2025 12:46
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.

This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.

This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
forgotten. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.

Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.

The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the forgotten
state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.

So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation: only the stream SAC coordinator deals with
the status of stream SAC consumers.

2 new configuration entries are introduced:
 * rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
   disconnected-to-forgotten timer.
 * rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
   in the coordinator. It used to be a fixed value of 30 seconds. The
   default value is still the same. The setting has been introduced to
   make integration tests faster.
The clean-up of a stream connection state when a stream member goes down can
remove subscriptions not affected by the member. The subscription state is
removed from the connection, but the subscription is not removed from
the SAC state (if the subscription is a SAC), because the subscription member
PID does not match the down member PID.

When the actual member of the subscription goes down, the subscription is no
longer part of the state, so the clean-up does not find the subscription
and does not remove it from the SAC state. This lets a ghost consumer in
the corresponding SAC group.

This commit makes sure only the affected subscriptions are removed from
the state when a stream member goes down.

Fixes #13961
@acogoluegnes acogoluegnes force-pushed the stream-sac-coordinator-status-instead-of-active-flag branch from 965ee4a to a956572 Compare May 28, 2025 13:51
@acogoluegnes acogoluegnes changed the title Introduce stream SAC status instead of active flag Prevent blocked groups in stream SAC with fine-grained status May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant