-
Notifications
You must be signed in to change notification settings - Fork 2.2k
discovery: add panic recovery for gossip message processing #10470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
discovery: add panic recovery for gossip message processing #10470
Conversation
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the fault tolerance of the gossip message processing within the discovery subsystem. By implementing a robust panic recovery mechanism, the system can now gracefully handle unexpected runtime errors in critical goroutines, preventing service interruptions. The recovery process includes logging detailed stack traces and ensuring proper resource cleanup, which will be invaluable for diagnosing and fixing underlying issues. This change represents a proactive step towards a more stable and resilient network operation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable panic recovery mechanism for gossip message processing, significantly enhancing the daemon's robustness. The implementation is sound, and the accompanying tests are thorough. My main suggestion is to refactor the logging within the new recoverGossipPanic function to use structured logging, aligning with the repository's style guide. This will improve log observability and consistency. I've provided detailed comments and code suggestions to facilitate this change.
| defer ctx.gossiper.recoverGossipPanic( | ||
| "testing", nMsg, &jobID, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enable structured logging, the call to recoverGossipPanic needs to be updated to pass a background context.
| defer ctx.gossiper.recoverGossipPanic( | |
| "testing", nMsg, &jobID, | |
| ) | |
| defer ctx.gossiper.recoverGossipPanic( | |
| context.Background(), "testing", nMsg, &jobID, | |
| ) |
References
- The style guide (lines 235-253) mandates the use of structured logging (
slog), which requires passing a context to logging functions likelog.ErrorS. (link)
| defer ctx.gossiper.recoverGossipPanic( | ||
| "testing", nMsg, &parentJobID, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enable structured logging, the call to recoverGossipPanic needs to be updated to pass a background context.
| defer ctx.gossiper.recoverGossipPanic( | |
| "testing", nMsg, &parentJobID, | |
| ) | |
| defer ctx.gossiper.recoverGossipPanic( | |
| context.Background(), "testing", nMsg, &parentJobID, | |
| ) |
References
- The style guide (lines 235-253) mandates the use of structured logging (
slog), which requires passing a context to logging functions likelog.ErrorS. (link)
|
@Roasbeef, remember to re-request review from reviewers when ready |
db8a839 to
b8bd1aa
Compare
ziggie1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks fine to me, had minor comments,
however we need to be aware that this recovery design has its limits, it will for example not recover panics which happen in second level goroutines. Most of our DB insertions happen in the batch function which is a separate goroutine so we won't catch these errrors for example:
handleNetworkMessages (Goroutine 1)
├─ defer recoverGossipPanic() ← Can only catch panics in THIS goroutine
│
├─ processNetworkAnnouncement()
│ └─ handleChanAnnouncement()
│ └─ d.cfg.Graph.AddEdge(ctx, edge, ops...)
│ └─ b.addEdge(ctx, edge, op...)
│ └─ b.cfg.Graph.AddChannelEdge(ctx, edge, op...)
│ └─ scheduler.Execute(ctx, request)
│ └─ go s.b.trigger(ctx) ← NEW GOROUTINE! (batch/scheduler.go:85)
│ └─ b.run(ctx) ← NO PANIC RECOVERY!
│ └─ req.Do(tx) ← 💥 If panic here → DAEMON CRASHES!
│
└─ recoverGossipPanic CANNOT catch the panic in the batch goroutine!
That's a good point. I think to cover deeper chains like that we would consider wrapping our existing |
In this commit, we add a centralized panic recovery mechanism for gossip goroutines. This increases the robustness of message processing in the gossiper, as now we are able to keep on trucking in the face of logic errors that may lead to panics. We ensure that any deps are freed and we log the panic trace to help catch bugs in the future.
In this commit, we extend the panic recovery mechanism to cover the serial processing path for AnnounceSignatures1 messages. Unlike other gossip messages which are processed in parallel goroutines, announcement signatures are processed serially in the main networkHandler loop. A panic during this serial processing would previously crash the entire gossiper. This change wraps the processing in an anonymous function with a deferred panic recovery, ensuring resilience without changing the serial processing semantics. Since AnnounceSignatures bypass the validation barrier, we pass nil for the jobID parameter.
b8bd1aa to
7f54408
Compare
ziggie1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In this commit, we add a centralized panic recovery mechanism for gossip
goroutines. This increases the robustness of message processing in the
gossiper, as now we are able to keep on trucking in the face of logic
errors that may lead to panics.
We ensure that any deps are freed and we log the panic trace to help
catch bugs in the future.
IMO this is a defensive pattern we should adopt in other sub-systems that
implement the p2p facing functionality of the daemon. A lil defensive
programming can go a long way.