Skip to content

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Ayanda-D
Copy link
Contributor

@Ayanda-D Ayanda-D commented Jul 15, 2025

Proposed Changes

Hi RMQ team 👋

We're still facing severe problems with Quorum Queues getting into a non-responsive state with no leader #13101. To recover these queues from a bad state (and to avoid deleting them and losing messages), we'd like to have the following capabilities to:

  1. Quickly pick the member/node with the highest index (commit|log||snapshot) - to recover to/from. This PR.
  2. Force shrink the damaged quorum queue to the member with the highest chosen index from step-1.
  3. Recover/re-grow the quorum queue to the target-quorum-cluster size (PR open)

The changes proposed here provide the CLI tool to be used for step-1 above, a command to get the member_with_highest_index to use for recovery, for a given QQ e.g:

ayandad@rmachine-1> rabbitmq-queues member_with_highest_index "Q.2" --index commit
Member with highest commit index for queue Q.2 in vhost / on node rabbit@rmachine-1...
┌───────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name         │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├───────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@rmachine-2 │ leader     │ voter      │ 17             │ 17           │ 17           │ 17           │ -1             │ 13   │ 7               │
└───────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

With this procedure, we can confirm that we able to restore the broken quorum queues back into a usable state, with all messages retained (depending on the member selected with the highest log/commit index - retaining all in the log). Internally, this using the queue's quorum-status to acquire the member highest index (seemed the least complex approach).

Please take a look - we'd really appreciate having these tools available. Currently, this is the only way we are able to fix/recover/revive broken Quorum Queues which are seeing quite regularly, back to a usable state. These procedure(s) all working well on our end, and restoring service as expected.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

_ ->
Acc
end
end, {-100, []}, Status),
Copy link
Contributor Author

@Ayanda-D Ayanda-D Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FYI, the -100 here is just an arbitrary number to initialise the accumulator, to ensure its less than the initial index value of -1, e.g. initial snapshot index = -1: https://github.com/rabbitmq/ra/blob/v2.16.11/src/ra_snapshot.erl#L195

@michaelklishin michaelklishin changed the title QQ: CLI tool to pick member with highest index rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index Jul 15, 2025
@michaelklishin
Copy link
Collaborator

@Ayanda-D there's nothing wrong with having this command. Can we rename it to just member_with_highest_index, though.

@Ayanda-D
Copy link
Contributor Author

@michaelklishin ok, i've updated name to member_with_highest_index

@kjnilsson
Copy link
Contributor

We're still facing severe problems with Quorum Queues getting into a non-responsive state with no leader #13101.

Do you still have this issue on a supported OSS version (i.e. 4.1)?

@Ayanda-D
Copy link
Contributor Author

Ayanda-D commented Jul 17, 2025

hi @kjnilsson (sorry for the delay) we are seeing this on 3.12.14. However, the issue seems very closely related to rabbitmq/ra#514 and #13131 (which seem to have not been resolved yet?) - we see logs flooded with the same crash when the QQs get into this leader-less state.

Then, I managed to to produce a very similar state from main branch cdd9ba1 (using same procedures listed here #13131 + full broker restart):

ayandad@host rabbit % sbin/rabbitmq-queues quorum_status qq1 --node rabbit-1
Status of quorum queue qq1 on node rabbit-1@host ...
┌─────────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name           │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-1@host       │ noproc     │ unknown    │ 976            │ 976          │ 976          │ 976          │ 974            │ 977  │ 7               │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-2@host       │ noproc     │ unknown    │ 977            │ 977          │ 976          │ 976          │ 975            │ 978  │ 7               │
├─────────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-3@host       │ pre_vote   │ voter      │ 10006          │ 10006        │ 10006        │ 10006        │ 10006          │ 978  │ 7               │
└─────────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

We are ntill not precisely sure whats causing this, hence these workarounds to at least restore such queues.

@kjnilsson
Copy link
Contributor

Then, I managed to to produce a very similar state from main branch cdd9ba1 (using same procedures listed here #13131 + full broker restart):

Do you delete queues whilst a node is down then re-declare them with the same name?

This is a very specific and a bit odd use case (why delete and re-declare queues with the same name?) that is causing the issues you mention.

I just feel this utility is a bit odd as it takes seconds to work this out using the existing rabbitmq-queues quorum_status command and it feels like it is addressing issues in unsupported RabbitMQ versions.

@Ayanda-D
Copy link
Contributor Author

Ayanda-D commented Jul 17, 2025

Do you delete queues whilst a node is down then re-declare them with the same name?

Not really, e.g. an example is routine sanity checks that run periodically, sometimes when one or more nodes are undergoing restarts (even if queue.delete and queue.declare are carried out between - for same queue name). We expect Quorum Queues to remain stable in such cases - but they can be in this leaderless state indefinitely (until deletion).

We are also not always in full control of what thousands of client applications do. Same SLA guarantees need to be on Quorum Queues as with Classic Mirrored Queues for general AMQP operations, despite behind the scenes operations such as node restarts taking place with majority of nodes always available (we understand there are some unsupported features, but getting into a state of potential message loss / queue deletion is a blocker).

I just feel this utility is a bit odd as it takes seconds to work this out using the existing

Doing the computation on the node is still the fastest, we just loop through ClusterSize times (which is typically 5 or 7) to get the member with highest index - time incurred (which should be a few milliseconds) should be sufficient on the operator end - compared to doing this from external scripts, parsing outputs, etc which is much slower (I'm happy to change this for a faster suggested approach? we desperately need this or similar).

it feels like it is addressing issues in unsupported RabbitMQ versions

We dont have specific references addressing these problems to justify that upgrading will fix this issue:

And also managed to produce the same state from main branch. This tool to help recover the queue, even from what we're observing from latest main branch.

@kjnilsson
Copy link
Contributor

We dont have specific references addressing these problems to justify that upgrading will fix this issue:

There is very little evidence linking any of what you have seen to these issues. We have however fixed many other issues that could result in broken Ra clusters since 3.12. Too many to list. Upgrading to the latest 4.1.x release will help. Also it would help us as we can then justify spending time investigating any further issues you see.

@Ayanda-D
Copy link
Contributor Author

We understand there is no clear path to fixing this issue - hence we are adding/contributing operator tools to recover from these currently unrecoverable states, i.e. in this case "a new CLI command to help us choose the safest member to shrink to" to use with existing "force shrink" and "grow" commands (and QQ leader-health-checks #13433 which help us detect the problem).

We are able to clone rabbit-server and reproduce the same "crash" and broken "quorum-status" on the latest main branch. I thought this is enough to justify that there's a bug / problem in the most latest release. Is there a specific patch for this crash at least, or a recommended recovery strategy?

We want to upgrade, but 3.12.x gives us 2 options for HA (QQ and CMQ. In case one fails we have a backup HA plan). Risk of upgrading when we're still seeing this same leaderless quorum-status issue on main is too high (and too scary tbh 😬). But with these cli tools and procedures, we can justify a clear recovery plan in case things fail.

@kjnilsson
Copy link
Contributor

We are able to clone rabbit-server and reproduce the same "crash" and broken "quorum-status" on the latest main branch. I thought this is enough to justify that there's a bug / problem in the most latest release. Is there a specific patch for this crash at least, or a recommended recovery strategy?

This issue is being worked on here #14241 - taking the highest index as a "best target for recovery" isn't necessarily going to work for this case as the old "revived" member is likely to have a higher index but contain older data.

We want to upgrade, but 3.12.x gives us 2 options for HA (QQ and CMQ. In case one fails we have a backup HA plan). Risk of upgrading when we're still seeing this same leaderless quorum-status issue on main is too high (and too scary tbh 😬). But with these cli tools and procedures, we can justify a clear recovery plan in case things fail.

You are using a very old unsupported version. I cannot stress this fact enough. It is not a good idea to stay on 3.12.x, in fact, given the fixes and improvements available in 4.1.x I think it is reckless to stay on an old version that you know you have problems with.

@Ayanda-D
Copy link
Contributor Author

This issue is being worked on here #14241

ok great, so the version that'll make sense for us to upgrade to, from 3.12.x is most likely to be 4.1.3. Assuming this patch is merged and does the job (needs to be tested). We're keen to upgrade, but confidence levels on using QQs entirely are not yet high on our side atm till this problem is clearly fixed (any chance of complete queue unavailability would be very detrimental). We also see need for these cli tools to offer a recovery path to try revive the queue and prevent complete outages if things fail.

highest index as a "best target for recovery" isn't necessarily going to work for this case as the old "revived" member is likely to have a higher index but contain older data

we'd care about what's been committed at this point. unless there's another metric we use to choose the most reliable online member? we also expect publish-confirms should nack client the moment the queue is unreachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants