rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Ayanda-D · 2025-07-15T16:54:52Z

Proposed Changes

Hi RMQ team 👋

We're still facing severe problems with Quorum Queues getting into a non-responsive state with no leader #13101. To recover these queues from a bad state (and to avoid deleting them and losing messages), we'd like to have the following capabilities to:

Quickly pick the member/node with the highest index (commit|log||snapshot) - to recover to/from. This PR.
Force shrink the damaged quorum queue to the member with the highest chosen index from step-1.
Recover/re-grow the quorum queue to the target-quorum-cluster size (PR open)

The changes proposed here provide the CLI tool to be used for step-1 above, a command to get the member_with_highest_index to use for recovery, for a given QQ e.g:

ayandad@rmachine-1> rabbitmq-queues member_with_highest_index "Q.2" --index commit
Member with highest commit index for queue Q.2 in vhost / on node rabbit@rmachine-1...
┌───────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name         │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├───────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@rmachine-2 │ leader     │ voter      │ 17             │ 17           │ 17           │ 17           │ -1             │ 13   │ 7               │
└───────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

With this procedure, we can confirm that we able to restore the broken quorum queues back into a usable state, with all messages retained (depending on the member selected with the highest log/commit index - retaining all in the log). Internally, this using the queue's quorum-status to acquire the member highest index (seemed the least complex approach).

Please take a look - we'd really appreciate having these tools available. Currently, this is the only way we are able to fix/recover/revive broken Quorum Queues which are seeing quite regularly, back to a usable state. These procedure(s) all working well on our end, and restoring service as expected.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

…command for quorum queues

Ayanda-D · 2025-07-15T17:01:27Z

deps/rabbit/src/rabbit_quorum_queue.erl

+                                    _ ->
+                                        Acc
+                                end
+                            end, {-100, []}, Status),


just FYI, the -100 here is just an arbitrary number to initialise the accumulator, to ensure its less than the initial index value of -1, e.g. initial snapshot index = -1: https://github.com/rabbitmq/ra/blob/v2.16.11/src/ra_snapshot.erl#L195

michaelklishin · 2025-07-15T18:44:13Z

@Ayanda-D there's nothing wrong with having this command. Can we rename it to just member_with_highest_index, though.

Ayanda-D · 2025-07-16T09:52:36Z

@michaelklishin ok, i've updated name to member_with_highest_index

kjnilsson · 2025-07-16T10:00:13Z

We're still facing severe problems with Quorum Queues getting into a non-responsive state with no leader #13101.

Do you still have this issue on a supported OSS version (i.e. 4.1)?

Ayanda-D · 2025-07-17T10:18:26Z

hi @kjnilsson (sorry for the delay) we are seeing this on 3.12.14. However, the issue seems very closely related to rabbitmq/ra#514 and #13131 (which seem to have not been resolved yet?) - we see logs flooded with the same crash when the QQs get into this leader-less state.

Then, I managed to to produce a very similar state from main branch cdd9ba1 (using same procedures listed here #13131 + full broker restart):

ayandad@host rabbit % sbin/rabbitmq-queues quorum_status qq1 --node rabbit-1
Status of quorum queue qq1 on node rabbit-1@host ...
┌─────────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name           │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-1@host       │ noproc     │ unknown    │ 976            │ 976          │ 976          │ 976          │ 974            │ 977  │ 7               │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-2@host       │ noproc     │ unknown    │ 977            │ 977          │ 976          │ 976          │ 975            │ 978  │ 7               │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-3@host       │ pre_vote   │ voter      │ 10006          │ 10006        │ 10006        │ 10006        │ 10006          │ 978  │ 7               │
└─────────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

We are ntill not precisely sure whats causing this, hence these workarounds to at least restore such queues.

kjnilsson · 2025-07-17T10:43:57Z

Then, I managed to to produce a very similar state from main branch cdd9ba1 (using same procedures listed here #13131 + full broker restart):

Do you delete queues whilst a node is down then re-declare them with the same name?

This is a very specific and a bit odd use case (why delete and re-declare queues with the same name?) that is causing the issues you mention.

I just feel this utility is a bit odd as it takes seconds to work this out using the existing rabbitmq-queues quorum_status command and it feels like it is addressing issues in unsupported RabbitMQ versions.

Ayanda-D · 2025-07-17T11:44:48Z

Do you delete queues whilst a node is down then re-declare them with the same name?

Not really, e.g. an example is routine sanity checks that run periodically, sometimes when one or more nodes are undergoing restarts (even if queue.delete and queue.declare are carried out between - for same queue name). We expect Quorum Queues to remain stable in such cases - but they can be in this leaderless state indefinitely (until deletion).

We are also not always in full control of what thousands of client applications do. Same SLA guarantees need to be on Quorum Queues as with Classic Mirrored Queues for general AMQP operations, despite behind the scenes operations such as node restarts taking place with majority of nodes always available (we understand there are some unsupported features, but getting into a state of potential message loss / queue deletion is a blocker).

I just feel this utility is a bit odd as it takes seconds to work this out using the existing

Doing the computation on the node is still the fastest, we just loop through ClusterSize times (which is typically 5 or 7) to get the member with highest index - time incurred (which should be a few milliseconds) should be sufficient on the operator end - compared to doing this from external scripts, parsing outputs, etc which is much slower (I'm happy to change this for a faster suggested approach? we desperately need this or similar).

it feels like it is addressing issues in unsupported RabbitMQ versions

We dont have specific references addressing these problems to justify that upgrading will fix this issue:

And also managed to produce the same state from main branch. This tool to help recover the queue, even from what we're observing from latest main branch.

kjnilsson · 2025-07-17T12:00:01Z

We dont have specific references addressing these problems to justify that upgrading will fix this issue:

There is very little evidence linking any of what you have seen to these issues. We have however fixed many other issues that could result in broken Ra clusters since 3.12. Too many to list. Upgrading to the latest 4.1.x release will help. Also it would help us as we can then justify spending time investigating any further issues you see.

Ayanda-D · 2025-07-17T14:41:04Z

We understand there is no clear path to fixing this issue - hence we are adding/contributing operator tools to recover from these currently unrecoverable states, i.e. in this case "a new CLI command to help us choose the safest member to shrink to" to use with existing "force shrink" and "grow" commands (and QQ leader-health-checks #13433 which help us detect the problem).

We are able to clone rabbit-server and reproduce the same "crash" and broken "quorum-status" on the latest main branch. I thought this is enough to justify that there's a bug / problem in the most latest release. Is there a specific patch for this crash at least, or a recommended recovery strategy?

We want to upgrade, but 3.12.x gives us 2 options for HA (QQ and CMQ. In case one fails we have a backup HA plan). Risk of upgrading when we're still seeing this same leaderless quorum-status issue on main is too high (and too scary tbh 😬). But with these cli tools and procedures, we can justify a clear recovery plan in case things fail.

kjnilsson · 2025-07-17T14:54:36Z

We are able to clone rabbit-server and reproduce the same "crash" and broken "quorum-status" on the latest main branch. I thought this is enough to justify that there's a bug / problem in the most latest release. Is there a specific patch for this crash at least, or a recommended recovery strategy?

This issue is being worked on here #14241 - taking the highest index as a "best target for recovery" isn't necessarily going to work for this case as the old "revived" member is likely to have a higher index but contain older data.

We want to upgrade, but 3.12.x gives us 2 options for HA (QQ and CMQ. In case one fails we have a backup HA plan). Risk of upgrading when we're still seeing this same leaderless quorum-status issue on main is too high (and too scary tbh 😬). But with these cli tools and procedures, we can justify a clear recovery plan in case things fail.

You are using a very old unsupported version. I cannot stress this fact enough. It is not a good idea to stay on 3.12.x, in fact, given the fixes and improvements available in 4.1.x I think it is reckless to stay on an old version that you know you have problems with.

Ayanda-D · 2025-07-17T17:20:12Z

This issue is being worked on here #14241

ok great, so the version that'll make sense for us to upgrade to, from 3.12.x is most likely to be 4.1.3. Assuming this patch is merged and does the job (needs to be tested). We're keen to upgrade, but confidence levels on using QQs entirely are not yet high on our side atm till this problem is clearly fixed (any chance of complete queue unavailability would be very detrimental). We also see need for these cli tools to offer a recovery path to try revive the queue and prevent complete outages if things fail.

highest index as a "best target for recovery" isn't necessarily going to work for this case as the old "revived" member is likely to have a higher index but contain older data

we'd care about what's been committed at this point. unless there's another metric we use to choose the most reliable online member? we also expect publish-confirms should nack client the moment the queue is unreachable.

Ayanda-D added 2 commits July 15, 2025 17:12

implement rabbitmq-queues pick_member_with_highest_index <queue> cli …

0024e65

…command for quorum queues

add tests for acquiring qq member with highest index

2d0f9a6

Ayanda-D commented Jul 15, 2025

View reviewed changes

michaelklishin changed the title ~~QQ: CLI tool to pick member with highest index~~ rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index Jul 15, 2025

rename cli command to rabbitmq-queues member_with_highest_index <queue>

e94e296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Ayanda-D commented Jul 15, 2025 •

edited

Loading

Uh oh!

Ayanda-D Jul 15, 2025 •

edited

Loading

Uh oh!

michaelklishin commented Jul 15, 2025

Uh oh!

Ayanda-D commented Jul 16, 2025

Uh oh!

kjnilsson commented Jul 16, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025 •

edited

Loading

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025 •

edited

Loading

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025

Uh oh!

Uh oh!

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Are you sure you want to change the base?

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Conversation

Ayanda-D commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

Ayanda-D Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelklishin commented Jul 15, 2025

Uh oh!

Ayanda-D commented Jul 16, 2025

Uh oh!

kjnilsson commented Jul 16, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025

Uh oh!

kjnilsson commented Jul 17, 2025

Uh oh!

Ayanda-D commented Jul 17, 2025

Uh oh!

Uh oh!

Ayanda-D commented Jul 15, 2025 •

edited

Loading

Ayanda-D Jul 15, 2025 •

edited

Loading

Ayanda-D commented Jul 17, 2025 •

edited

Loading

Ayanda-D commented Jul 17, 2025 •

edited

Loading