Refactor networking #66

mpretty-cyro · 2025-09-23T02:11:58Z

This PR refactors the session_network to be far more configurable and easier to extend, as well as fix a number of bugs which existed in the original implementation. The main interface has now been genericised with the routing and transport mechanisms abstracted from the client; the updated Request structure also makes it easier to pre-construct requests which should allow for abstracting more of the network requests in the future.

It also includes a number of new configuration options, some particularly useful ones include:

The ability to route requests through either Onion Requests, Lokinet or Direct to their destination
The ability to use a custom devnet environment

Note: This contains a breaking change for clients which don't currently use networking - ENABLE_ONIONREQ has been renamed to ENABLE_NETWORKING to be more accurate (this defaults to on so any clients which have it disabled will need to update their build flag)

Ports under 1024 are priveledged and were failing (at least on Linux) when running as a normal user.

We already have oxen-libquic, oxen-logging, and nlohmann via lokinet, so get them via that nested submodule rather than having a duplicated submodule in libsession-util itself. Also removes the macos workaround call to `oxen_logging_add_source_dir` because that directive no longer does anything.

- LOKINET_EMBEDDED=ON is replaced with LOKINET_FULL=OFF - LOKINET_BOOTSTRAP was removed - LOKINET_DAEMON=OFF is not strictly needed (it should be default) but makes it clear what we're doing.

- Load libquic before oxenc/oxen-logging so that libquic has a chance to set up its oxen-logging, etc. targets before libsession tries. - Remove unneeded settings to disable tests/docs (these are [now] the dependency defaults when not doing a top-level project build). - Update to depend on proper lokinet::liblokinet target.

• Added missing config options • Added exponential backoffs for retries (and a retry limit for path building) • Fixed a couple of issues with the logic to finish refreshing the snode pool

• Fixed a use-after-move issue • Fixed an issue where the OnionRequestRouter would start trying to make requests before the SnodePool bootstrap was completed

• Added a missing import • Updated the OnionRequestRouter to wait for the SnodePool to be populated before allowing any requests to be sent • Updated the SnodePool to make ephemeral connections to refresh it's cache (that way we won't always use seed node connections for subsequent requests on new accounts) • Fixed some use-after-move issues • Fixed an issue were the SnodePool bootstrap request response wasn't being handled

• Fixed an infinite loop with the OnionRequestRouter refreshing the SnodePool while a refresh was already running • Fixed an edge-case where the SnodePool wouldn't trigger a refresh when all nodes are marked as failed

• Added parse and expose the general network settings the clients use (network_time_offset, hardfork_version, softfork_version) • Added error handling from old logic • Added 421 retry handling • Fixed an issue where retrying the snode refresh would cause a deadlock

• Added initial LokinetRouter wrapper • Added changes that were missing from previous commit • Updated QuicTransport to be able to send requests to RemoteAddress directly

• Added factory functions for the FileServer endpoints the clients use • Ran the formatter • Fixed a linker error • Fixed a bug where we were incorrectly reporting successful responses as failures

• Added a log when succeeding after a 421 retry (old code had it) • Added logic to mark a node as failed after a QUIC handshake timeout • Added a connection status hook and logic to track the connection status • Added a function to retrieve the current active paths (TODO for the LokinetRouter)

• Added logic so the OnionRequestRouter can observe connection failures to it's guard nodes and trigger path rebuilds when they happen • Fixed an issue where paths in 'single_path_mode' wouldn't get rebuild

• Renamed the `ENABLE_ONIONREQ` flag to `ENABLE_NETWORKING` • Started working on unit tests

• Updated to the latest looking • Tweaked the lokinet config to listen on a random port (multiple simulators we colliding without this) • Fixed a bug where we would incorrectly use the timestamp value returned from a server for the network offset time (some server return seconds instead of milliseconds which break things) • Started fixing up unit tests

• Added the `DirectRouter` • Added unit tests for the SnodePool `get_unused_nodes` function • Updated SnodePool to use `weak_ptr` everywhere to avoid invalid memory crashes during tests • Removed old outdated unit tests • Fixed a bug where the RequestQueue could incorrectly start checking for request timeouts even though it didn't need to

# Conflicts: # external/oxen-libquic # src/session_network.cpp # tests/test_session_network.cpp

jagerman · 2025-10-06T16:55:11Z

src/network/routing/onion_request_router.cpp

+    // Use 'call_get' to force this to be synchronous
+    if (_loop)
+        _loop->call_get([this] { _close_connections(); });
+    log::debug(cat, "[OnionRequestRouter] Destroyed.");


All of these [OnionRequestRouter] embedded strings feels like a duplication of what the log::Cat is meant to be. Can we change this around to auto cat = oxen::log::Cat("OnionRequestRouter"); and then drop all of these redundant message prefixes?

jagerman · 2025-10-06T16:58:45Z

src/network/routing/onion_request_router.cpp

+
+    // Attempty to verify connectivity to the guard node
+    _pending_paths[path_id] = path_nodes;
+    auto guard_node = path_nodes.front();


Can we do a global rename of "guard" to "edge", so that we have matching terminology between onion requests and (lokinet) onion routing?

jagerman · 2025-10-06T19:19:22Z

src/network/backends/session_file_server.cpp

+    constexpr auto ENDPOINT_FILE = "file";
+}  // namespace
+
+Request upload(


We need to rethink the API for upload/download, because uploads and downloads will no longer be generic "requests" soon (i.e. with fileserver on lokinet) but instead will be per-stream quic transfers, and so shoehorning it into a "Request" means we will have to break the API when we add streamed upload/downloads. (Lokinet requests also don't have headers, but rather use a dict prepending the file stream data).

Basically, when we use onion requests, we need the file server host, pubkey, endpoint, and so on, but when you are doing lokinet, the transfer will be entirely different and we need something that can accommodate both approaches so that we don't have to break any binding code when we add the alternative.

That means that we need an abstraction that represents a "upload"/"download"/"get_client_version" where the host and keys are implementation details (i.e. inside the onion request handler, or entirely different lokinet keys/handling/etc. in the lokinet handler) rather than request components that can be determined out here when constructing the Request object, simply because of how different these mechanisms are going to be.

The other thing that we likely want is a different callback mechanism that is going to work seamlessly as we transition to streaming encryption and file transfers. This is the current interface for submitting a request:

virtual void send_request(Request request, network_response_callback_t callback);

where you just have a one-shot, all-done callback. That's fine for most requests, but for uploads/downloads we are going to want something different.

Just to brainstorm ideas, something like this would work fine for streaming, and wouldn't be too burdensome for onion request uploads/downloads:

struct file_metadata { std::string id; int size; std::chrono::sys_seconds uploaded; std::chrono::sys_seconds expiry; }; struct UploadRequest { UploadRequest( std::function<std::vector<unsigned char>> next_data, std::optional<std::string> file_name, std::chrono::milliseconds stall_timeout, std::function<void(UploadRequest& req, std::variant<file_metadata, int16_t> info_or_errcode, bool timeout)> on_complete ); // ... }; struct DownloadRequest { DownloadRequest( std::string file_id, std::chrono::milliseconds stall_timeout, std::function<void(DownloadRequest& req, std::variant<file_metadata, int16_t> info_or_errcode, bool timeout)> on_complete, std::function<void(DownloadRequest& req, const file_metadata& info, std::vector<unsigned char>)> on_data = nullptr, std::chrono::milliseconds partial_min_interval = 250ms ); // ... }; // And in IRouter: class IRouter { public: // ... virtual void upload(std::shared_ptr<UploadRequest> up) = 0; virtual void download(std::shared_ptr<DownloadRequest> up) = 0; };

where stall_timeout is a timeout that fires when nothing has progressed (no successful path build, or no new data could be sent (upload) or has been received (download) in the given time interval). This is a bit like request_timeout, except that it resets every time something progresses so that large uploads/downloads on slow connections still work without timing out.

For an upload, next_data will be called repeatedly as more data can be sent to the remote until it returns an empty vector (signaling the end of the upload) or throws an exception (cancelling the upload). It can simply return everything in one go if it already has it in memory, but also allows for avoiding needing to slam everything into ram, especially once we chain it with encryption streaming.

For a download, data would get passed as it arrives (but at most once every partial_min_interval, so that the caller can decide how to balance callback overhead with memory usage), and then once all data has been received successfully or an error occurs (which could be partway through the download) calls on_complete with metadata (success) or an error code. For onion requests, this is still going to be a one-shot single call to on_data when the entire thing is complete, but for lokinet transfers we'll actually get it streaming in (and so can provide metrics like download speed).

Internally, in the onion request router implementation, the implementations of OnionRequestRouter::upload and ::download would basically just take the Upload/DownloadRequest object, and convert it into a call to send_request (i.e. accumulate all the data via next_data(), load all the file server endpoint/address/pubkeys/etc., and then provide a callback to send_request that translates the response into a (possible) call to on_data and a call to on_complete.

These changes aren't critical for this PR (i.e. we could merge this PR without them), but I think it might be worth building them in now so that code using it doesn't have to change in the near future as we aim to be more stream oriented.

jagerman · 2025-10-06T19:19:27Z

src/network/backends/session_file_server.cpp

+            UploadInfo{std::move(file_name)}};
+}
+
+Request download(


I was a bit mislead (until I read further) by the name: upload and download at first blush made me think this is where the upload or download happens, but these functions only make an upload/download request. Perhaps make_upload / make_download would better convey that?

(However, given my comments above, if each of those becomes a bespoke object then these would just become constructors, e.g. UploadRequest upload{...}; instead of auto upload = file_server::upload(...) so this comment may be irrelevant).

jagerman · 2025-10-06T19:25:06Z