Skip to content

Conversation

@ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Dec 22, 2025

This commit fixes a critical race condition in MarkChanFullyClosed and
pruneLinkNode where link nodes could be incorrectly deleted despite
having pending channel.

The race occurred because the check for open channels and the link node
deletion happened in separate database transactions:

Thread A: TX1 checks open channels → [] (empty)
Thread A: TX1 commits
Thread B: Opens new channel with same peer
Thread A: TX2 deletes link node (using stale data)
Result: Link node deleted despite pending channel existing

This creates a TOCTOU (time-of-check to time-of-use) vulnerability where
database state changes between reading the channel count and deleting
the node.

Fix for MarkChanFullyClosed:

  • Move link node deletion into the same transaction as the channel
    closing check, making the check-and-delete operation atomic

Fix for pruneLinkNode:

  • Add double-check within the write transaction to verify no channels
    were opened since the caller's initial check
  • Maintains performance by keeping early return for common case
  • Prevents deletion if channels exist at delete time

This ensures the invariant: "link node exists iff channels exist"
is never violated, preventing database corruption and potential
connection issues.

@gemini-code-assist
Copy link

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within the channeldb package that could result in the erroneous deletion of link nodes, even when active or pending channels existed. By making channel state checks and link node pruning operations atomic or by introducing robust double-checks within database transactions, the changes ensure data integrity and prevent potential connection disruptions caused by stale database states.

Highlights

  • Race Condition Fix: Addressed a critical race condition in MarkChanFullyClosed and pruneLinkNode that could lead to incorrect deletion of link nodes, causing database corruption and connection issues. This resolves a Time-of-Check to Time-of-Use (TOCTOU) vulnerability.
  • Atomic Deletion in MarkChanFullyClosed: The link node deletion logic within MarkChanFullyClosed has been moved into the same database transaction as the channel closing check, ensuring atomicity and preventing race conditions during channel closure.
  • Double-Check in pruneLinkNode: Implemented a robust double-check within the pruneLinkNode's write transaction. This verifies that no new channels were opened since the initial check, preventing premature link node deletion even if concurrent operations occur.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ziggie1984 ziggie1984 self-assigned this Dec 22, 2025
@ziggie1984 ziggie1984 added channel closing Related to the closing of channels cooperatively and uncooperatively channels data-race Related to data-races labels Dec 22, 2025
@ziggie1984 ziggie1984 added this to the v0.20.1 milestone Dec 22, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes a critical TOCTOU race condition in MarkChanFullyClosed and pruneLinkNode. By moving the link node existence check and deletion into the same atomic database transaction, it prevents a state where a link node could be deleted despite a new channel being opened. The fix in MarkChanFullyClosed is direct and effective. The fix in pruneLinkNode correctly uses a double-check pattern to maintain performance while ensuring correctness. I have one suggestion to improve the implementation of pruneLinkNode for better code clarity and consistency.

@ziggie1984
Copy link
Collaborator Author

Happened out there in the wild, was reported by a user:

2025-12-21 21:09:07.723 [INF] PEER: Peer(XXX): Local close channel request is going to be delivered to the peer
2025-12-21 21:09:07.729 [INF] PEER: Peer(XXX): Delivery addr for channel close: XXX
2025-12-21 21:14:56.725 [INF] FNDG: Initiating fundingRequest(local_amt=XXX BTC (subtract_fees=false), push_amt=0 mSAT, chain_hash=XX, peer=XXX, min_confs=XXX)
2025-12-21 21:15:25.451 [INF] CHDB: Pruning link node XXX with zero open channels from database

So although a Pending OpenChannel flow was going on in the background we deleted the node from the db.

@ziggie1984 ziggie1984 marked this pull request as ready for review December 22, 2025 09:05
@ziggie1984 ziggie1984 force-pushed the fix-prune-node-race branch 2 times, most recently from 50cad71 to 1186e3e Compare December 22, 2025 09:22
@ziggie1984 ziggie1984 requested a review from Roasbeef December 22, 2025 09:24
@wildsats
Copy link

My node was impacted by this bug. We had to ask the peer to force close the channel and then used chantools to sweep the funds.

// Decide whether we want to remove the link node, based upon the number
// of still open channels.
return c.pruneLinkNode(openChannels, pruneLinkNode)
err = deleteLinkNode(tx, remotePub)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we should have been doing this in the same db transaction the entire time.

Do we need any sort of reconciliation logic on start up to handle instances that might've happened in the wild? So sanity check that for each open channel, we have a link node present.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a fix at startup, not sure if it is worth it but I think it does not cost much either and can be removed when moving to sql native ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is worth it

My thinking is that there may be nodes in an anomalous state that have a channel, but not an active LinkNode.

Eg, on start up, we use this to determine who we ned to connect out to:

lnd/server.go

Lines 3489 to 3495 in 801de79

// Iterate through the list of LinkNodes to find addresses we should
// attempt to connect to based on our set of previous connections. Set
// the reconnection port to the default peer port.
linkNodes, err := s.chanStateDB.LinkNodeDB().FetchAllLinkNodes()
if err != nil && !errors.Is(err, channeldb.ErrLinkNodesNotFound) {
return fmt.Errorf("failed to fetch all link nodes: %w", err)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh you are right did not see that. But this is now fixed with the repair function in the beginning of the server startup.

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️

"from database",
remotePub.SerializeCompressed())

return c.linkNodeDB.DeleteLinkNode(remotePub)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this DeleteLinkNode can now be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still used for tests so that we can add and remove link nodes, I think we can still keep it.

@lightninglabs-deploy
Copy link

@ziggie1984, remember to re-request review from reviewers when ready

@ziggie1984 ziggie1984 requested a review from Roasbeef January 5, 2026 19:47
@ziggie1984 ziggie1984 added the backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. label Jan 5, 2026
// Decide whether we want to remove the link node, based upon the number
// of still open channels.
return c.pruneLinkNode(openChannels, pruneLinkNode)
err = deleteLinkNode(tx, remotePub)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is worth it

My thinking is that there may be nodes in an anomalous state that have a channel, but not an active LinkNode.

Eg, on start up, we use this to determine who we ned to connect out to:

lnd/server.go

Lines 3489 to 3495 in 801de79

// Iterate through the list of LinkNodes to find addresses we should
// attempt to connect to based on our set of previous connections. Set
// the reconnection port to the default peer port.
linkNodes, err := s.chanStateDB.LinkNodeDB().FetchAllLinkNodes()
if err != nil && !errors.Is(err, channeldb.ErrLinkNodesNotFound) {
return fmt.Errorf("failed to fetch all link nodes: %w", err)
}

channeldb/db.go Outdated

// Link node doesn't exist, so we need to create it.
// We create it within a transaction to ensure consistency.
err = kvdb.Update(c.backend, func(tx kvdb.RwTx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not aggregate the DB transactions? WE can call into linkNodeDB in the same transaction. This'll speed up start up time, if we are calling this on start up each time. So the first loop can make a set of the nodes we're connected to and check if a link node exists or not. Then one db txn makes all the links nodes, exiting earlier if that's not needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point did not pay attention that every linknodedb call was creating its own transaction. Now only 2 transactions are used. Also enhanced the linknodedb abstraction with the appropriate functions.

@saubyk saubyk moved this to In review in lnd v0.20 Jan 6, 2026
This commit fixes a critical race condition in MarkChanFullyClosed and
pruneLinkNode where link nodes could be incorrectly deleted despite
having pending or open channels.

The race occurred because the check for open channels and the link node
deletion happened in separate database transactions:

  Thread A: TX1 checks open channels → [] (empty)
  Thread A: TX1 commits
  Thread B: Opens new channel with same peer
  Thread A: TX2 deletes link node (using stale data)
  Result: Link node deleted despite pending channel existing

This creates a TOCTOU (time-of-check to time-of-use) vulnerability where
database state changes between reading the channel count and deleting
the node.

Fix for MarkChanFullyClosed:
- Move link node deletion into the same transaction as the channel
  closing check, making the check-and-delete operation atomic

Fix for pruneLinkNode:
- Add double-check within the write transaction to verify no channels
  were opened since the caller's initial check
- Maintains performance by keeping early return for common case
- Prevents deletion if channels exist at delete time

This ensures the invariant: "link node exists iff channels exist"
is never violated, preventing database corruption and potential
connection issues.
@ziggie1984 ziggie1984 force-pushed the fix-prune-node-race branch from eb5c070 to 270fc8d Compare January 6, 2026 09:52
@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical race condition in link node pruning by ensuring that channel checks and node deletions occur within the same atomic database transaction. The introduction of RepairLinkNodes is a thoughtful addition to correct any database inconsistencies that may have resulted from the previous bug. The changes are well-structured, and the new tests are comprehensive. I have one suggestion to refactor a new function for improved readability.

We make sure that nodes previously suffering from this error will
have a consitent db view when restarting their node.
@ziggie1984 ziggie1984 force-pushed the fix-prune-node-race branch from 270fc8d to d9fb909 Compare January 6, 2026 14:58
@ziggie1984 ziggie1984 requested a review from Roasbeef January 6, 2026 19:49
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥨

@ziggie1984 ziggie1984 merged commit d5d151c into lightningnetwork:master Jan 7, 2026
41 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in lnd v0.20 Jan 7, 2026
@ziggie1984 ziggie1984 deleted the fix-prune-node-race branch January 7, 2026 07:54
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Successfully created backport PR for v0.20.x-branch:

Roasbeef added a commit that referenced this pull request Jan 7, 2026
…20.x-branch

[v0.20.x-branch] Backport #10462: channeldb: fix race condition in link node pruning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. channel closing Related to the closing of channels cooperatively and uncooperatively channels data-race Related to data-races no-changelog

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants