Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] PIP-327: Support force topic loading for unrecoverable errors #21759

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

rdhabalia
Copy link
Contributor

@rdhabalia rdhabalia commented Dec 19, 2023

It fixes #21751

PIP: #21752

Motivation

We have introduced a configuration called autoSkipNonRecoverableData before open-sourcing Pulsar as we have come across with various situations when it was not possible to recover ledgers belonging to managed-ledger or managed-cursors and the broker was not able to load the topics. In such situations,autoSkipNonRecoverableData flag helps to skip non-recoverable leger-recovery errors such as ledger_not_found and allows the broker to load topics by skipping such ledgers in disaster recovery.

Brokers can recognize such non-recoverable errors using bookkeeper error codes but in some cases, it’s very tricky and not possible to conclude non-recoverable errors. For example, the broker can not differentiate between all the ensemble bookies of the ledgers that are temporarily unavailable or are permanently removed from the cluster without graceful recovery, and because of that broker doesn’t consider all the bookies deleted as a non-recoverable error though we can not recover ledgers in such situations where all the bookies are removed due to various reasons such as Dev cluster clean up or system faced data disaster with multiple bookie loss. In such situations, the system admin has to manually identify such non-recoverable topics and update those topics’ managed-ledger and managed-cursor’s metadata and reload topics again which requires a lot of manual effort and sometimes it might not be feasible to handle such situations with a large number of topics that require this manual procedure to fix those topics.

2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Opened ledger 94267 for consumer pulsar.repl.dev-eastb. rc=0
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-0.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not 
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
 E46 from bookie: euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-2.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not 
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
 E46 from bookie: euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L94267 E46-E46, Sent to [euw1
bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster
.local:3181], Heard from [] : bitset = {}, Error = 'Bookie handle is not available'. First unread entry is (-1, rc = null)
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Error reading from metadata ledger 94267 for consumer pulsar.repl.dev-eastb: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Recovery for cursor pulsar.repl.dev-eastb failed
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Closing managed ledger
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://prop/us-west2/ns1/per
sistent/t1-partition-220
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/1.1.1.1:59962][persistent://prop/us-west2/ns1/t1-partition-220][my-local] Failed to create consumer: consumerId=5369828, org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available

Modifications

Therefore, the system admin should have a dynamic configuration called managedLedgerForceRecovery to use in such situations to allow brokers to forcefully load topics by skipping ledger failures to avoid topic unavailability and perform auto repairs of the topics. This will allow the admin to handle disaster recovery situations in a controlled and automated manner and maintain the topic availability by mitigating such failures.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@rdhabalia rdhabalia added area/broker doc-not-needed Your PR changes do not impact docs labels Dec 19, 2023
@rdhabalia rdhabalia self-assigned this Dec 19, 2023
@rdhabalia rdhabalia changed the title [improve][pip] PIP-327: Support force topic loading for unrecoverable errors [improve][broker] PIP-327: Support force topic loading for unrecoverable errors Dec 19, 2023
@rdhabalia rdhabalia force-pushed the force_load branch 2 times, most recently from eaaee76 to ec52b6d Compare December 19, 2023 19:48
@Technoboy- Technoboy- added this to the 3.3.0 milestone Dec 22, 2023
@aloyszhang
Copy link
Contributor

Should we consider the exceptions when initializing the ManagedLedgerImpl?

@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@dao-jun
Copy link
Member

dao-jun commented May 14, 2024

@rdhabalia This PIP are approved, are we still working on the PR?

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 76.19048% with 5 lines in your changes missing coverage. Please review.

Project coverage is 74.57%. Comparing base (bbc6224) to head (97ccefa).
Report is 631 commits behind head on master.

Files with missing lines Patch % Lines
...broker/service/schema/BookkeeperSchemaStorage.java 60.00% 4 Missing ⚠️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21759      +/-   ##
============================================
+ Coverage     73.57%   74.57%   +0.99%     
- Complexity    32624    34561    +1937     
============================================
  Files          1877     1936      +59     
  Lines        139502   145378    +5876     
  Branches      15299    15893     +594     
============================================
+ Hits         102638   108410    +5772     
+ Misses        28908    28668     -240     
- Partials       7956     8300     +344     
Flag Coverage Δ
inttests 27.92% <57.14%> (+3.34%) ⬆️
systests 24.52% <47.61%> (+0.20%) ⬆️
unittests 73.91% <66.66%> (+1.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/bookkeeper/mledger/ManagedLedgerConfig.java 96.44% <100.00%> (+0.15%) ⬆️
...org/apache/pulsar/broker/ServiceConfiguration.java 98.97% <100.00%> (-0.43%) ⬇️
...rg/apache/pulsar/broker/service/BrokerService.java 83.54% <100.00%> (+2.76%) ⬆️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 80.18% <66.66%> (+0.88%) ⬆️
...broker/service/schema/BookkeeperSchemaStorage.java 77.97% <60.00%> (+0.40%) ⬆️

... and 612 files with indirect coverage changes

@rdhabalia rdhabalia merged commit 38322a6 into apache:master Oct 4, 2024
51 checks passed
@rdhabalia rdhabalia deleted the force_load branch October 4, 2024 04:58
@lhotari
Copy link
Member

lhotari commented Oct 8, 2024

This PR introduced a new flaky test, #23417.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Brokers are not able to load topics in non-recoverable situation
9 participants