Background
PR #10719 fixes a deadlock in channelLink.Stop() by reordering the
teardown sequence so that htlcManager exits before hodl subscription
state is cleaned up. That fix closes the specific race window and is
safe to ship immediately.
However, the underlying design has a structural fragility that the
ordering fix only papers over. This issue tracks the deeper redesign
needed to make hodl subscriptions robust by construction.
The Structural Problem
The current design couples the lifetime of a ConcurrentQueue
(owned by channelLink) with the routing of registry callbacks
(owned by InvoiceRegistry). Specifically:
channelLink creates a hodlQueue *queue.ConcurrentQueue and passes
hodlQueue.ChanIn() directly into InvoiceRegistry.NotifyExitHopHtlc.
- The registry stores that raw
chan<- interface{} in
hodlSubscriptions and later writes to it in notifyHodlSubscribers.
- There is no mechanism in the registry to know whether the channel is
still live at the time of the write.
This means the registry's correctness depends on an external invariant
(the link's queue must outlive any subscription registered against it)
that is enforced only by teardown ordering — a fragile guarantee that
can be violated by any future refactor of Stop(), by new callers of
NotifyExitHopHtlc, or by code paths that bypass the ordering fix.
Proposed Redesign Directions
Option A — Attach a quit channel to each subscription
Pass a <-chan struct{} (e.g. cg.Done()) alongside ChanIn() when
subscribing. notifyHodlSubscribers selects on both:
select {
case sub <- resolution:
case <-quit:
// link is gone; drop and clean up
}
This makes every send non-blocking with respect to link lifetime and
lets the registry self-heal without relying on unsubscription ordering.
It is the fix direction suggested in issue #8803.
Option B — Registry-owned delivery, link-owned callback
Instead of exposing a raw channel, channelLink registers a callback
function with the registry. The registry calls the callback directly;
the link's callback can check cg.Done() internally and discard the
resolution if the link is shutting down. This removes the shared channel
entirely and makes the boundary explicit.
Option C — Move subscription lifecycle into channelLink
Give channelLink full ownership: it subscribes on start, it
unsubscribes on stop, and it owns the queue. htlcManager subscribes
once at startup via a dedicated method that returns a typed subscription
handle (rather than a raw channel). Teardown calls handle.Cancel()
which atomically marks the subscription dead in the registry before
draining.
Why This Matters for SQL Migration
The deadlock risk is amplified by KV channeldb (bbolt). The global
write lock on bbolt increases processRemoteRevokeAndAck latency,
which widens the race window between hodlQueue.Stop() and
cg.WgWait() that the ordering fix closes. Once channeldb is fully
migrated to native SQL (tracked separately), the amplification factor
drops — but the structural fragility remains and should be addressed
regardless of the storage backend.
Acceptance Criteria
Related
Background
PR #10719 fixes a deadlock in
channelLink.Stop()by reordering theteardown sequence so that
htlcManagerexits before hodl subscriptionstate is cleaned up. That fix closes the specific race window and is
safe to ship immediately.
However, the underlying design has a structural fragility that the
ordering fix only papers over. This issue tracks the deeper redesign
needed to make hodl subscriptions robust by construction.
The Structural Problem
The current design couples the lifetime of a
ConcurrentQueue(owned by
channelLink) with the routing of registry callbacks(owned by
InvoiceRegistry). Specifically:channelLinkcreates ahodlQueue *queue.ConcurrentQueueand passeshodlQueue.ChanIn()directly intoInvoiceRegistry.NotifyExitHopHtlc.chan<- interface{}inhodlSubscriptionsand later writes to it innotifyHodlSubscribers.still live at the time of the write.
This means the registry's correctness depends on an external invariant
(the link's queue must outlive any subscription registered against it)
that is enforced only by teardown ordering — a fragile guarantee that
can be violated by any future refactor of
Stop(), by new callers ofNotifyExitHopHtlc, or by code paths that bypass the ordering fix.Proposed Redesign Directions
Option A — Attach a quit channel to each subscription
Pass a
<-chan struct{}(e.g.cg.Done()) alongsideChanIn()whensubscribing.
notifyHodlSubscribersselects on both:This makes every send non-blocking with respect to link lifetime and
lets the registry self-heal without relying on unsubscription ordering.
It is the fix direction suggested in issue #8803.
Option B — Registry-owned delivery, link-owned callback
Instead of exposing a raw channel,
channelLinkregisters a callbackfunction with the registry. The registry calls the callback directly;
the link's callback can check
cg.Done()internally and discard theresolution if the link is shutting down. This removes the shared channel
entirely and makes the boundary explicit.
Option C — Move subscription lifecycle into channelLink
Give
channelLinkfull ownership: it subscribes on start, itunsubscribes on stop, and it owns the queue.
htlcManagersubscribesonce at startup via a dedicated method that returns a typed subscription
handle (rather than a raw channel). Teardown calls
handle.Cancel()which atomically marks the subscription dead in the registry before
draining.
Why This Matters for SQL Migration
The deadlock risk is amplified by KV channeldb (bbolt). The global
write lock on bbolt increases
processRemoteRevokeAndAcklatency,which widens the race window between
hodlQueue.Stop()andcg.WgWait()that the ordering fix closes. Once channeldb is fullymigrated to native SQL (tracked separately), the amplification factor
drops — but the structural fragility remains and should be addressed
regardless of the storage backend.
Acceptance Criteria
notifyHodlSubscriberscannot block indefinitely due to a deadsubscriber channel, regardless of teardown ordering.
is encapsulated behind a typed API that makes misuse a compile-time
or early-runtime error rather than a deadlock.
channelLink.Stop()andnotifyHodlSubscribersto prevent re-introduction of the race.Related
crack — dangling reference" and proposed quit-channel approach)