Skip to content

Fix segfault on closing down network connections#579

Open
edwardalee wants to merge 3 commits intomainfrom
net-fixes
Open

Fix segfault on closing down network connections#579
edwardalee wants to merge 3 commits intomainfrom
net-fixes

Conversation

@edwardalee
Copy link
Copy Markdown
Contributor

The problem was diagnosed and fixed by Claude Code. Below in its diagnosis:

Root Cause: Use-After-Free from shutdown_net Freeing Heap Memory

The segfault is caused by a use-after-free race condition introduced by PR #508's switch from raw socket file descriptors to heap-allocated socket_priv_t structs.

The Key Change

Before PR #508, connections were plain int socket file descriptors:

// Old lf_terminate_execution — no memory to free:
shutdown_socket(&_fed.sockets_for_inbound_p2p_connections[i], false);
_fed.sockets_for_inbound_p2p_connections[i] = -1;

After PR #508, connections are net_abstraction_t (pointers to malloc'd socket_priv_t):

int shutdown_net(net_abstraction_t net_abs, bool read_before_closing) {
  if (net_abs == NULL) {
    LF_PRINT_LOG("Socket already closed.");
    return 0;
  }
  socket_priv_t* priv = (socket_priv_t*)net_abs;
  int ret = shutdown_socket(&priv->socket_descriptor, read_before_closing);
  free_net(net_abs);  // <-- FREES the socket_priv_t!
  return ret;
}

The Race

In lf_terminate_execution, inbound P2P connections are shut down (and freed) before the listener threads are joined:

  LF_PRINT_DEBUG("Closing incoming P2P network abstractions.");
  // Close any incoming P2P network abstractions that are still open.
  for (int i = 0; i < NUMBER_OF_FEDERATES; i++) {
    shutdown_net(_fed.net_for_inbound_p2p_connections[i], false);  // frees socket_priv_t!
    // ... sets to NULL ...
  }
  // ... closes outbound connections ...

  LF_PRINT_DEBUG("Waiting for inbound p2p network abstraction listener threads.");
  // Wait for each inbound network abstraction listener thread to close.
  // ... joins threads AFTER the memory is already freed ...

Meanwhile, listen_to_federates holds a local copy of the pointer:

  net_abstraction_t net = _fed.net_for_inbound_p2p_connections[fed_id];
  // ...
  while (!_lf_termination_executed) {
    // ...
    if (read_from_net_close_on_error(net, 1, buffer)) {

And read_from_net_close_on_error dereferences the (now freed) pointer on error:

int read_from_net_close_on_error(net_abstraction_t net_abs, size_t num_bytes, unsigned char* buffer) {
  LF_ASSERT_NON_NULL(net_abs);
  socket_priv_t* priv = (socket_priv_t*)net_abs;
  int read_failed = read_from_net(net_abs, num_bytes, buffer);
  if (read_failed) {
    shutdown_socket(&priv->socket_descriptor, false);  // priv points to freed memory!
    return -1;
  }
  return 0;
}

The crash sequence:

  1. Main thread enters termination() → calls lf_terminate_execution
  2. shutdown_net closes the socket (unblocking the listener's read()) and frees the socket_priv_t
  3. Listener thread's blocked read() returns with an error
  4. read_from_net_close_on_error accesses priv->socket_descriptor through the freed pointer → use-after-free
  5. shutdown_socket writes *socket = -1 to the freed memory → heap corruption → SIGSEGV

Before PR #508, this was safe because shutdown_socket only closed the integer file descriptor — no heap memory was involved, so there was nothing for the concurrent thread to dereference into freed memory.

There's Also the Same Race in handle_tagged_message

The stop-tag discard path (seen in your trace) also frees the connection while listen_to_federates still holds a local pointer to it:

      _lf_done_using(message_token);
      // Close network abstraction, reading any incoming data and discarding it.
      shutdown_net(_fed.net_for_inbound_p2p_connections[fed_id], false);  // frees the memory
      _fed.net_for_inbound_p2p_connections[fed_id] = NULL;
      LF_MUTEX_UNLOCK(&env->mutex);
      return -1;

In this case, handle_tagged_message is called from listen_to_federates on the same thread, and the listener exits immediately afterward — so this path alone is safe. However, if lf_terminate_execution concurrently reads _fed.net_for_inbound_p2p_connections[fed_id] between the shutdown_net (free) and the = NULL assignment, it would call shutdown_net on a freed (non-NULL) pointer — a double-free.


Two Pre-Existing (Non-Segfault) Bugs Exposed in the Same Path

These existed before PR #508 but are worth noting:

1. Missing tag barrier decrement (decentralized mode). The tag barrier is incremented at line 546 but the stop-tag discard path returns at line 650 without calling _lf_decrement_tag_barrier_locked(env):

  _lf_increment_tag_barrier(env, intended_tag);

The normal exit and failed-read paths both decrement it, but the stop-tag path does not.

2. Token with ref_count = 0 passed to _lf_done_using. _lf_new_token creates a token with ref_count = 0. The stop-tag path calls _lf_done_using(message_token), which sees ref_count == 0, prints the "Token being freed that has already been freed" warning, and returns without freeing either the token or the message_contents payload — a memory leak.


Implemented Fix

The core fix is to separate socket shutdown from memory deallocation so that lf_terminate_execution can unblock the listener threads without freeing the memory they reference. Specifically:

Split shutdown_net into two phases: a shutdown_net that only closes the socket (to unblock reads), and a separate free_net that deallocates memory. Call shutdown_net before joining threads, and free_net after.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a federated runtime shutdown crash by separating “close the connection” from “free the network abstraction,” preventing listener threads from dereferencing freed heap memory during termination.

Changes:

  • Introduces close_net() (close-only) and refactors shutdown_net() to be close_net() + free_net().
  • Updates federate shutdown logic to close inbound P2P connections before joining listener threads, and free them only after listeners exit; also fixes the stop-tag discard path to properly free tokens and decrement the tag barrier.
  • Documentation-only update to LF code-fence language tags in initialize_from_file.h.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
util/initialize_from_file.h Updates documentation code-fence language tags (lf-clf).
network/impl/src/lf_socket_support.c Implements close_net() and refactors shutdown_net() to close then free.
network/api/net_abstraction.h Documents and exposes close_net() and free_net() APIs, clarifying threading expectations.
core/federated/federate.c Uses close_net() to unblock inbound listener threads before join; frees after join; fixes stop-tag discard token/barrier handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants