-
Notifications
You must be signed in to change notification settings - Fork 3.9k
rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
_ -> | ||
Acc | ||
end | ||
end, {-100, []}, Status), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI, the -100
here is just an arbitrary number to initialise the accumulator, to ensure its less than the initial index value of -1
, e.g. initial snapshot index = -1
: https://github.com/rabbitmq/ra/blob/v2.16.11/src/ra_snapshot.erl#L195
@Ayanda-D there's nothing wrong with having this command. Can we rename it to just |
@michaelklishin ok, i've updated name to |
Do you still have this issue on a supported OSS version (i.e. 4.1)? |
hi @kjnilsson (sorry for the delay) we are seeing this on 3.12.14. However, the issue seems very closely related to rabbitmq/ra#514 and #13131 (which seem to have not been resolved yet?) - we see logs flooded with the same crash when the QQs get into this leader-less state. Then, I managed to to produce a very similar state from
We are ntill not precisely sure whats causing this, hence these workarounds to at least restore such queues. |
Do you delete queues whilst a node is down then re-declare them with the same name? This is a very specific and a bit odd use case (why delete and re-declare queues with the same name?) that is causing the issues you mention. I just feel this utility is a bit odd as it takes seconds to work this out using the existing |
Not really, e.g. an example is routine sanity checks that run periodically, sometimes when one or more nodes are undergoing restarts (even if We are also not always in full control of what thousands of client applications do. Same SLA guarantees need to be on Quorum Queues as with Classic Mirrored Queues for general AMQP operations, despite behind the scenes operations such as node restarts taking place with majority of nodes always available (we understand there are some unsupported features, but getting into a state of potential message loss / queue deletion is a blocker).
Doing the computation on the node is still the fastest, we just loop through
We dont have specific references addressing these problems to justify that upgrading will fix this issue:
And also managed to produce the same state from |
There is very little evidence linking any of what you have seen to these issues. We have however fixed many other issues that could result in broken Ra clusters since 3.12. Too many to list. Upgrading to the latest 4.1.x release will help. Also it would help us as we can then justify spending time investigating any further issues you see. |
We understand there is no clear path to fixing this issue - hence we are adding/contributing operator tools to recover from these currently unrecoverable states, i.e. in this case "a new CLI command to help us choose the safest member to shrink to" to use with existing "force shrink" and "grow" commands (and QQ leader-health-checks #13433 which help us detect the problem). We are able to clone rabbit-server and reproduce the same "crash" and broken "quorum-status" on the latest main branch. I thought this is enough to justify that there's a bug / problem in the most latest release. Is there a specific patch for this crash at least, or a recommended recovery strategy? We want to upgrade, but 3.12.x gives us 2 options for HA (QQ and CMQ. In case one fails we have a backup HA plan). Risk of upgrading when we're still seeing this same leaderless quorum-status issue on |
This issue is being worked on here #14241 - taking the highest index as a "best target for recovery" isn't necessarily going to work for this case as the old "revived" member is likely to have a higher index but contain older data.
You are using a very old unsupported version. I cannot stress this fact enough. It is not a good idea to stay on 3.12.x, in fact, given the fixes and improvements available in 4.1.x I think it is reckless to stay on an old version that you know you have problems with. |
ok great, so the version that'll make sense for us to upgrade to, from 3.12.x is most likely to be 4.1.3. Assuming this patch is merged and does the job (needs to be tested). We're keen to upgrade, but confidence levels on using QQs entirely are not yet high on our side atm till this problem is clearly fixed (any chance of complete queue unavailability would be very detrimental). We also see need for these cli tools to offer a recovery path to try revive the queue and prevent complete outages if things fail.
we'd care about what's been committed at this point. unless there's another metric we use to choose the most reliable online member? we also expect publish-confirms should nack client the moment the queue is unreachable. |
Proposed Changes
Hi RMQ team 👋
We're still facing severe problems with Quorum Queues getting into a non-responsive state with no leader #13101. To recover these queues from a bad state (and to avoid deleting them and losing messages), we'd like to have the following capabilities to:
commit|log||snapshot
) - to recover to/from. This PR.The changes proposed here provide the CLI tool to be used for step-1 above, a command to get the
member_with_highest_index
to use for recovery, for a given QQ e.g:With this procedure, we can confirm that we able to restore the broken quorum queues back into a usable state, with all messages retained (depending on the member selected with the highest log/commit index - retaining all in the log). Internally, this using the queue's quorum-status to acquire the member highest index (seemed the least complex approach).
Please take a look - we'd really appreciate having these tools available. Currently, this is the only way we are able to fix/recover/revive broken Quorum Queues which are seeing quite regularly, back to a usable state. These procedure(s) all working well on our end, and restoring service as expected.
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.