Skip to content

Conversation

@Roasbeef
Copy link
Member

In this commit, we add a centralized panic recovery mechanism for gossip
goroutines. This increases the robustness of message processing in the
gossiper, as now we are able to keep on trucking in the face of logic
errors that may lead to panics.

We ensure that any deps are freed and we log the panic trace to help
catch bugs in the future.

IMO this is a defensive pattern we should adopt in other sub-systems that
implement the p2p facing functionality of the daemon. A lil defensive
programming can go a long way.

@Roasbeef Roasbeef added discovery Peer and route discovery / whisper protocol related issues/PRs healthcheck labels Dec 30, 2025
@Roasbeef Roasbeef added this to the v0.20.1 milestone Dec 30, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the fault tolerance of the gossip message processing within the discovery subsystem. By implementing a robust panic recovery mechanism, the system can now gracefully handle unexpected runtime errors in critical goroutines, preventing service interruptions. The recovery process includes logging detailed stack traces and ensuring proper resource cleanup, which will be invaluable for diagnosing and fixing underlying issues. This change represents a proactive step towards a more stable and resilient network operation.

Highlights

  • Panic Recovery for Gossip Goroutines: Introduced a centralized panic recovery mechanism specifically for gossip goroutines to enhance the robustness of message processing.
  • Increased System Resilience: The new mechanism allows the system to continue operating even when logic errors lead to panics in gossip message processing, preventing crashes.
  • Improved Debugging and Resource Management: Upon panic, the system ensures that any dependencies are freed and logs the full panic trace, aiding in future bug identification and resolution.
  • Defensive Programming Pattern: The author suggests this pattern as a valuable defensive measure for other p2p-facing sub-systems within the daemon.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable panic recovery mechanism for gossip message processing, significantly enhancing the daemon's robustness. The implementation is sound, and the accompanying tests are thorough. My main suggestion is to refactor the logging within the new recoverGossipPanic function to use structured logging, aligning with the repository's style guide. This will improve log observability and consistency. I've provided detailed comments and code suggestions to facilitate this change.

Comment on lines 4927 to 4949
defer ctx.gossiper.recoverGossipPanic(
"testing", nMsg, &jobID,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To enable structured logging, the call to recoverGossipPanic needs to be updated to pass a background context.

Suggested change
defer ctx.gossiper.recoverGossipPanic(
"testing", nMsg, &jobID,
)
defer ctx.gossiper.recoverGossipPanic(
context.Background(), "testing", nMsg, &jobID,
)
References
  1. The style guide (lines 235-253) mandates the use of structured logging (slog), which requires passing a context to logging functions like log.ErrorS. (link)

Comment on lines 5016 to 5083
defer ctx.gossiper.recoverGossipPanic(
"testing", nMsg, &parentJobID,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To enable structured logging, the call to recoverGossipPanic needs to be updated to pass a background context.

Suggested change
defer ctx.gossiper.recoverGossipPanic(
"testing", nMsg, &parentJobID,
)
defer ctx.gossiper.recoverGossipPanic(
context.Background(), "testing", nMsg, &parentJobID,
)
References
  1. The style guide (lines 235-253) mandates the use of structured logging (slog), which requires passing a context to logging functions like log.ErrorS. (link)

@saubyk saubyk added this to lnd v0.20 Jan 4, 2026
@saubyk saubyk moved this to In progress in lnd v0.20 Jan 4, 2026
@ziggie1984 ziggie1984 added no-changelog backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. labels Jan 5, 2026
@ziggie1984 ziggie1984 self-requested a review January 5, 2026 20:08
@lightninglabs-deploy
Copy link

@Roasbeef, remember to re-request review from reviewers when ready

@saubyk saubyk requested review from ellemouton and gijswijs and removed request for ellemouton January 6, 2026 17:38
@Roasbeef Roasbeef force-pushed the discovery-panic-recovery branch from db8a839 to b8bd1aa Compare January 7, 2026 02:18
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks fine to me, had minor comments,

however we need to be aware that this recovery design has its limits, it will for example not recover panics which happen in second level goroutines. Most of our DB insertions happen in the batch function which is a separate goroutine so we won't catch these errrors for example:

 handleNetworkMessages (Goroutine 1)
    ├─ defer recoverGossipPanic()  ← Can only catch panics in THIS goroutine
    │
    ├─ processNetworkAnnouncement()
    │   └─ handleChanAnnouncement()
    │       └─ d.cfg.Graph.AddEdge(ctx, edge, ops...)
    │           └─ b.addEdge(ctx, edge, op...)
    │               └─ b.cfg.Graph.AddChannelEdge(ctx, edge, op...)
    │                   └─ scheduler.Execute(ctx, request)
    │                       └─ go s.b.trigger(ctx)  ← NEW GOROUTINE! (batch/scheduler.go:85)
    │                           └─ b.run(ctx)       ← NO PANIC RECOVERY!
    │                               └─ req.Do(tx)   ← 💥 If panic here → DAEMON CRASHES!
    │
    └─ recoverGossipPanic CANNOT catch the panic in the batch goroutine!

@Roasbeef
Copy link
Member Author

Roasbeef commented Jan 7, 2026

however we need to be aware that this recovery design has its limits, it will for example not recover panics which happen in second level goroutines

That's a good point. I think to cover deeper chains like that we would consider wrapping our existing wg wrapper with a recover call. Here I'm after just shallow items within the gossip processing logic, will take a look at if things need to be extended slightly more.

In this commit, we add a centralized panic recovery mechanism for gossip
goroutines. This increases the robustness of message processing in the
gossiper, as now we are able to keep on trucking in the face of logic
errors that may lead to panics.

We ensure that any deps are freed and we log the panic trace to help
catch bugs in the future.
In this commit, we extend the panic recovery mechanism to cover the
serial processing path for AnnounceSignatures1 messages. Unlike other
gossip messages which are processed in parallel goroutines, announcement
signatures are processed serially in the main networkHandler loop.

A panic during this serial processing would previously crash the entire
gossiper. This change wraps the processing in an anonymous function with
a deferred panic recovery, ensuring resilience without changing the
serial processing semantics.

Since AnnounceSignatures bypass the validation barrier, we pass nil for
the jobID parameter.
@Roasbeef Roasbeef force-pushed the discovery-panic-recovery branch from b8bd1aa to 7f54408 Compare January 7, 2026 20:35
@Roasbeef Roasbeef requested a review from ziggie1984 January 7, 2026 20:38
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saubyk saubyk moved this from In progress to In review in lnd v0.20 Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. discovery Peer and route discovery / whisper protocol related issues/PRs healthcheck no-changelog

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants