simln-lib/refactor: fully deterministic produce events #277

f3r10 · 2025-06-03T20:21:20Z

Description

The goal of this PR is to achieve fully deterministic runs to get reproducible simulations

Changes

nodes: HashMap<PublicKey, Arc<Mutex<dyn LightningNode>>> Update HashMap for a BTreeMap. A HashMap does not maintain an order, which has an impact when the simulation is running, making the results unpredictable. Using a BTreeMap, the order of the nodes is always the same.
dispatch_producers acts as a master task, generating all the payments of the nodes, getting the random destination, and only then spawning a threat for producing the events (produce_events)

Addresses #243

f3r10 · 2025-06-04T14:20:06Z

Opening in draft still need to fix some issues.
But I think it is ready for a first review pass @carlaKC

carlaKC · 2025-06-09T17:45:50Z

btw don't worry about fixups until this is out of draft - when review hasn't properly started it's okay to just squash em!

carlaKC

Direction looking good here! Main comment is that I think we need to have a step where we replenish our heap by calling generate_payments again?

If payment_count().is_none() we return a single payment from generate_payments
For RandomPaymentActivity, this means we'll do one payment per node and then shut down?

Related to this is that we possibly don't want to queue up tons of events for when payment_count is defined (say, we want a million payments, we'll queue up a million items which is a bit of a memory waste). This probably isn't much of a big deal, because I'd imagine this use case is primarily for smaller numbers but something to keep in mind as we address the above requirement.

Also would be good to rebase this early on to get to a more current base 🙏

f3r10 · 2025-06-10T20:10:14Z

Direction looking good here! Main comment is that I think we need to have a step where we replenish our heap by calling generate_payments again?

The idea would be to generate all the payments at once, so the master task would dispatch the events.

If payment_count().is_none() we return a single payment from generate_payments

Yes, in this case, only one payment is generated

For RandomPaymentActivity, this means we'll do one payment per node and then shut down?

Yes, right now it is working in this mode 🤔

Related to this is that we possibly don't want to queue up tons of events for when payment_count is defined (say, we want a million payments, we'll queue up a million items which is a bit of a memory waste). This probably isn't much of a big deal, because I'd imagine this use case is primarily for smaller numbers but something to keep in mind as we address the above requirement.

Yes, you are right, maybe it would be better to create some batches of payments. I am going to try to come up with an alternative to reduce the memory waste. 🤔

f3r10 · 2025-06-13T16:06:27Z

Hi @carlaKC , I've developed a new approach for the event generation system. The core idea is to centralize the random number generation to ensure deterministic outcomes for our simulations.

Here's a breakdown of the design:

Central Manager Task: A dedicated thread runs a central manager. This manager is the sole source for generating both random wait times and random destinations. By centralizing this, we ensure that the sequence of random numbers generated for these critical values is entirely reproducible, given a fixed seed.
Executor Event Listeners: For each executor, a separate thread is spawned. These threads act as listeners for payment events, forwarding them to the designated consumers once received.
Payment Event Generators: Concurrently, for each executor, another thread is spawned. These threads are responsible for generating payment events in a continuous loop (e.g., for RandomActivity). Each generator thread communicates with the central manager via a dedicated channel to request a wait time. After awaiting the specified duration, it sends another event to the manager to trigger the calculation of a random destination. Once the destination is determined, the manager dispatches a final event to the respective event listener thread (as described in the previous point).

This design ensures that the wait times and final destinations are entirely deterministic across simulation runs. However, there is a new challenge with the non-deterministic order of thread execution.

The Determinism Challenge

While the values generated (wait times, destinations) are fixed if the random number generator is seeded, the order in which the executor threads request these values is not guaranteed. For example, if we have ex1 and ex2 executors:

Execution 1:
    ex1 gets wait_time 0 → destination node_3
    ex2 gets wait_time 1 → destination node_4

Execution 2 (possible non-deterministic order):
    ex2 gets wait_time 0 → destination node_3
    ex1 gets wait_time 1 → destination node_4

This means that even though the sequence of random numbers from the central manager is the same, which executor consumes which number from that sequence is left to the operating system's scheduler, leading to variations in the overall simulation flow.

Proposed Solution for Execution Order

To achieve full simulation determinism, including the order of execution, I'm considering adding a tiny, randomized initial sleep time before each executor thread begins its main loop. While seemingly counter-intuitive, this jitter can effectively "break ties" in thread scheduling in a controlled, reproducible way when combined with a seeded random number generator. This would allow us to deterministically influence which thread acquires the next available random number from the central manager.

WDYT?

carlaKC · 2025-06-17T15:32:18Z

Deleted previous comment - it had some misunderstandings.

Why can't we keep the current approach of generating a queue of events and then replenish the queue when we run out of events? By generating all of our payment data in one place, we don't need to worry about thread execution order.

I think that this can be as simple as pushing a new event to the queue every time we pop one? We might need to track some state for payment count (because we'll need to remember how many we've had), but for random activity it should be reasonable.

carlaKC · 2025-06-17T15:47:11Z

Rough sketch of what I was picturing:

Queue up initial set of events:

for each executor
- Get wait time and destination
- Push wait time, destination and ExecutorKit onto head

Read from heap:

Pop event off heap
Sleep until wait time is reached
Send SimulationEvent::SendPayment into the channel for the executor
Generate a new wait time and destination from the ExecutorKit
Read from heap until shutdown

Instinct about this is:

Always generating payment destination in one places fixes our determinism issue
Re-queueing on pop + sorting by time means we'll never run out of events for each executor
The nasty thing will be payment counts, we're probably going to have to store those in the heap and know when we don't need to queue anything else

carlaKC reviewed Jun 9, 2025

View reviewed changes

simln-lib/refactor: fully deterministic produce events

1b3a21f

f3r10 force-pushed the refactor_fully_deterministic_produce_events branch from b06a289 to 1b3a21f Compare June 10, 2025 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simln-lib/refactor: fully deterministic produce events #277

simln-lib/refactor: fully deterministic produce events #277

Uh oh!

f3r10 commented Jun 3, 2025 •

edited

Loading

Uh oh!

f3r10 commented Jun 4, 2025

Uh oh!

carlaKC commented Jun 9, 2025

Uh oh!

carlaKC left a comment •

edited

Loading

Uh oh!

f3r10 commented Jun 10, 2025

Uh oh!

f3r10 commented Jun 13, 2025 •

edited

Loading

Uh oh!

carlaKC commented Jun 17, 2025

Uh oh!

carlaKC commented Jun 17, 2025

Uh oh!

Uh oh!

simln-lib/refactor: fully deterministic produce events #277

Are you sure you want to change the base?

simln-lib/refactor: fully deterministic produce events #277

Uh oh!

Conversation

f3r10 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

f3r10 commented Jun 4, 2025

Uh oh!

carlaKC commented Jun 9, 2025

Uh oh!

carlaKC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

f3r10 commented Jun 10, 2025

Uh oh!

f3r10 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Determinism Challenge

Proposed Solution for Execution Order

Uh oh!

carlaKC commented Jun 17, 2025

Uh oh!

carlaKC commented Jun 17, 2025

Uh oh!

Uh oh!

f3r10 commented Jun 3, 2025 •

edited

Loading

carlaKC left a comment •

edited

Loading

f3r10 commented Jun 13, 2025 •

edited

Loading