-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] PIP-327: Support force topic loading for unrecoverable errors #21759
Conversation
eaaee76
to
ec52b6d
Compare
Should we consider the exceptions when initializing the |
...ar-broker/src/main/java/org/apache/pulsar/broker/service/schema/BookkeeperSchemaStorage.java
Show resolved
Hide resolved
@rdhabalia This PIP are approved, are we still working on the PR? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21759 +/- ##
============================================
+ Coverage 73.57% 74.57% +0.99%
- Complexity 32624 34561 +1937
============================================
Files 1877 1936 +59
Lines 139502 145378 +5876
Branches 15299 15893 +594
============================================
+ Hits 102638 108410 +5772
+ Misses 28908 28668 -240
- Partials 7956 8300 +344
Flags with carried forward coverage won't be shown. Click here to find out more.
|
This PR introduced a new flaky test, #23417. |
It fixes #21751
PIP: #21752
Motivation
We have introduced a configuration called
autoSkipNonRecoverableData
before open-sourcing Pulsar as we have come across with various situations when it was not possible to recover ledgers belonging to managed-ledger or managed-cursors and the broker was not able to load the topics. In such situations,autoSkipNonRecoverableData
flag helps to skip non-recoverable leger-recovery errors such as ledger_not_found and allows the broker to load topics by skipping such ledgers in disaster recovery.Brokers can recognize such non-recoverable errors using bookkeeper error codes but in some cases, it’s very tricky and not possible to conclude non-recoverable errors. For example, the broker can not differentiate between all the ensemble bookies of the ledgers that are temporarily unavailable or are permanently removed from the cluster without graceful recovery, and because of that broker doesn’t consider all the bookies deleted as a non-recoverable error though we can not recover ledgers in such situations where all the bookies are removed due to various reasons such as Dev cluster clean up or system faced data disaster with multiple bookie loss. In such situations, the system admin has to manually identify such non-recoverable topics and update those topics’ managed-ledger and managed-cursor’s metadata and reload topics again which requires a lot of manual effort and sometimes it might not be feasible to handle such situations with a large number of topics that require this manual procedure to fix those topics.
Modifications
Therefore, the system admin should have a dynamic configuration called
managedLedgerForceRecovery
to use in such situations to allow brokers to forcefully load topics by skipping ledger failures to avoid topic unavailability and perform auto repairs of the topics. This will allow the admin to handle disaster recovery situations in a controlled and automated manner and maintain the topic availability by mitigating such failures.Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: