Fix segfault on closing down network connections#579
Open
edwardalee wants to merge 3 commits intomainfrom
Open
Fix segfault on closing down network connections#579edwardalee wants to merge 3 commits intomainfrom
edwardalee wants to merge 3 commits intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a federated runtime shutdown crash by separating “close the connection” from “free the network abstraction,” preventing listener threads from dereferencing freed heap memory during termination.
Changes:
- Introduces
close_net()(close-only) and refactorsshutdown_net()to beclose_net()+free_net(). - Updates federate shutdown logic to close inbound P2P connections before joining listener threads, and free them only after listeners exit; also fixes the stop-tag discard path to properly free tokens and decrement the tag barrier.
- Documentation-only update to LF code-fence language tags in
initialize_from_file.h.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| util/initialize_from_file.h | Updates documentation code-fence language tags (lf-c → lf). |
| network/impl/src/lf_socket_support.c | Implements close_net() and refactors shutdown_net() to close then free. |
| network/api/net_abstraction.h | Documents and exposes close_net() and free_net() APIs, clarifying threading expectations. |
| core/federated/federate.c | Uses close_net() to unblock inbound listener threads before join; frees after join; fixes stop-tag discard token/barrier handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The problem was diagnosed and fixed by Claude Code. Below in its diagnosis:
Root Cause: Use-After-Free from
shutdown_netFreeing Heap MemoryThe segfault is caused by a use-after-free race condition introduced by PR #508's switch from raw socket file descriptors to heap-allocated
socket_priv_tstructs.The Key Change
Before PR #508, connections were plain
intsocket file descriptors:After PR #508, connections are
net_abstraction_t(pointers tomalloc'dsocket_priv_t):The Race
In
lf_terminate_execution, inbound P2P connections are shut down (and freed) before the listener threads are joined:Meanwhile,
listen_to_federatesholds a local copy of the pointer:And
read_from_net_close_on_errordereferences the (now freed) pointer on error:The crash sequence:
termination()→ callslf_terminate_executionshutdown_netcloses the socket (unblocking the listener'sread()) and frees thesocket_priv_tread()returns with an errorread_from_net_close_on_erroraccessespriv->socket_descriptorthrough the freed pointer → use-after-freeshutdown_socketwrites*socket = -1to the freed memory → heap corruption → SIGSEGVBefore PR #508, this was safe because
shutdown_socketonly closed the integer file descriptor — no heap memory was involved, so there was nothing for the concurrent thread to dereference into freed memory.There's Also the Same Race in
handle_tagged_messageThe stop-tag discard path (seen in your trace) also frees the connection while
listen_to_federatesstill holds a local pointer to it:In this case,
handle_tagged_messageis called fromlisten_to_federateson the same thread, and the listener exits immediately afterward — so this path alone is safe. However, iflf_terminate_executionconcurrently reads_fed.net_for_inbound_p2p_connections[fed_id]between theshutdown_net(free) and the= NULLassignment, it would callshutdown_neton a freed (non-NULL) pointer — a double-free.Two Pre-Existing (Non-Segfault) Bugs Exposed in the Same Path
These existed before PR #508 but are worth noting:
1. Missing tag barrier decrement (decentralized mode). The tag barrier is incremented at line 546 but the stop-tag discard path returns at line 650 without calling
_lf_decrement_tag_barrier_locked(env):The normal exit and failed-read paths both decrement it, but the stop-tag path does not.
2. Token with
ref_count = 0passed to_lf_done_using._lf_new_tokencreates a token withref_count = 0. The stop-tag path calls_lf_done_using(message_token), which seesref_count == 0, prints the "Token being freed that has already been freed" warning, and returns without freeing either the token or themessage_contentspayload — a memory leak.Implemented Fix
The core fix is to separate socket shutdown from memory deallocation so that
lf_terminate_executioncan unblock the listener threads without freeing the memory they reference. Specifically:Split
shutdown_netinto two phases: ashutdown_netthat only closes the socket (to unblock reads), and a separatefree_netthat deallocates memory. Callshutdown_netbefore joining threads, andfree_netafter.