Making the producer stalling configurable #6413

jing-flowdesk · 2025-01-27T14:34:34Z

Proposed change

Making the producer stalling configurable.

Currently, when you have a slow consumer on a subject, it can make the server throttle the provider.
ex:

// If we are a client and we detect that the consumer we are
// sending to is in a stalled state, go ahead and wait here
// with a limit.
if c.kind == CLIENT && client.out.stc != nil {
client.stalledWait(c)
}

I read on another issue that it was to protect GW and Routes ?
The proposal would be to make this configurable by allowing people to switch this off if they need.

Use case

Better handling event spikes by not stalling the producers if there are a slow consumer on some topic.

Contribution

The change looks not huge here but we would love to have some more inputs before saying we can do it.

The text was updated successfully, but these errors were encountered:

neilalexander · 2025-01-28T13:48:09Z

The stall gate here is to prevent the server from spiking in memory usage in queues when the subscriber is failing to keep up, otherwise the server could potentially OOM and lose it all anyway.

If your usage pattern features very spiky producers but stable or throttled subscribers, you may want to look at funnelling the data through streams instead.

jing-flowdesk · 2025-01-28T16:51:05Z

Thank you @neilalexander ,

are you talking about https://docs.nats.io/nats-concepts/jetstream/streams ?
Jetstream unfortunately is not matching our needs because notably of the increased latency when activating it.

We were wondering if we could have "Fire and Forget" operating mode where the producer is publishing at its own speed and never getting throttled. The consumer may or may not receive the message depending of its healthness.

roeschter · 2025-02-03T08:58:13Z

Flowdesk is looking for (mostly) real time data distribution. More generally, data which is aging fast where old data is of little use.
In this context is the quality of service cannot no longer be guaranteed if the producer stalls.
The suggestion would be to male the behavior controllable (on/off).

neilalexander · 2025-02-03T09:37:37Z

I think we can look into it.

MauriceVanVeen · 2025-02-06T13:47:05Z

Tried out disabling the stalled wait and doing a bench where the subscriber is too slow. This pretty quickly makes the server the producer is connected to freeze/OOM. (Just because you can publish WAY faster than you can receive messages.)

So it's not as simple as just adding a toggle to either stall or not. But it should be a toggle that either:

does do stalledWait, and applies back-pressure to the producer when a subscriber is slow
does not stall, no back-pressure for the producer, but would need to drop messages that can't be delivered to a subscriber in time

Would need to figure out exactly which queues get backed up and ensure they don't grow too much.

derekcollison · 2025-02-06T13:48:52Z

We have tests for this already for fan in and fan out. Could you describe your test setup a bit more?

jing-flowdesk · 2025-02-07T16:34:57Z

Hello,

@MauriceVanVeen
The second behavior you are describing where it "does not stall, no back-pressure for the producer, but would need to drop messages that can't be delivered to a subscriber in time" could match our needs.
The message will be dropped for the specific slow consumer and other health consumers subscribed on the same topic would be able to receive the message right ? I understood that the writeloop is flushing to subscribers one by one in a synchronous way, meaning if one of them is slow consuming early in the list, it could impact other subscribers after it right ?

@derekcollison
For our test we used nats:2.10.20-alpine
The setup:

We had two producers reacting to the same event and publishing a timestamped data on their own topic.
- Producer A on Topic A, Producer B on Topic B
We had multiple subscribers for each Topic
We simulated a slow consumer on Topic A
Producer A started to stall while Producer B was ok

Normally, when a producer detects that one of the consumer of a message is falling behind, it will stall. Which means that if a message has 2 consumers and the first is "slow", then it will affect the timely delivery to the second consumer. With the new option `no_fast_producer_stall=true`, the server will simply drop a message destined to a consumer that would have caused the producer to stall. The message is still delivered to consumers that are not falling behind. The option can be config-reload'ed and if a message is dropped due to fast-producer/slow-consumer, and the message was traced (with deliver option), then the message trace egress event will have an error indicating the reason why the message was not delivered. Resolves #6413 Signed-off-by: Ivan Kozlovic <[email protected]>

kozlovic · 2025-02-12T00:20:26Z

@derekcollison @neilalexander @MauriceVanVeen I think that there could be value in having a way to completely disable producer stalling in some situations as described by @jing-flowdesk. But of course, we can't simply ignore the stall and still attempt to deliver. Instead, if we drop the message for a slow consumer, this would allow the server to deliver it other non slow consumers. This is not the default behavior, so that should not impact users that do not want this behavior and prefer the current one.

I have the PR #6500 for consideration.

PS: I had issues with the tests running on Travis, so I had to tweak them several times...

…d) (#6500) Normally, when a producer detects that one of the consumer of a message is falling behind, it will stall. Which means that if a message has 2 consumers and the first is "slow", then it will affect the timely delivery to the second consumer. With the new option `no_fast_producer_stall=true`, the server will simply drop a message destined to a consumer that would have caused the producer to stall. The message is still delivered to consumers that are not falling behind. The option can be config-reload'ed and if a message is dropped due to fast-producer/slow-consumer, and the message was traced (with deliver option), then the message trace egress event will have an error indicating the reason why the message was not delivered. Resolves #6413 Signed-off-by: Ivan Kozlovic <[email protected]>

Normally, when a producer detects that one of the consumer of a message is falling behind, it will stall. Which means that if a message has 2 consumers and the first is "slow", then it will affect the timely delivery to the second consumer. With the new option `no_fast_producer_stall=true`, the server will simply drop a message destined to a consumer that would have caused the producer to stall. The message is still delivered to consumers that are not falling behind. The option can be config-reload'ed and if a message is dropped due to fast-producer/slow-consumer, and the message was traced (with deliver option), then the message trace egress event will have an error indicating the reason why the message was not delivered. Resolves #6413 Signed-off-by: Ivan Kozlovic <[email protected]>

jing-flowdesk added the proposal Enhancement idea or proposal label Jan 27, 2025

kozlovic mentioned this issue Feb 11, 2025

[ADDED] Option that disables fast producer stalling (drops msg instead) #6500

Merged

derekcollison closed this as completed in #6500 Feb 12, 2025

wallyqs mentioned this issue Mar 20, 2025

Add server config for stall client duration #2208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making the producer stalling configurable #6413

Making the producer stalling configurable #6413

jing-flowdesk commented Jan 27, 2025 •

edited

Loading

neilalexander commented Jan 28, 2025 •

edited

Loading

jing-flowdesk commented Jan 28, 2025

roeschter commented Feb 3, 2025

neilalexander commented Feb 3, 2025

MauriceVanVeen commented Feb 6, 2025

derekcollison commented Feb 6, 2025

jing-flowdesk commented Feb 7, 2025

kozlovic commented Feb 12, 2025

Making the producer stalling configurable #6413

Making the producer stalling configurable #6413

Comments

jing-flowdesk commented Jan 27, 2025 • edited Loading

Proposed change

Use case

Contribution

neilalexander commented Jan 28, 2025 • edited Loading

jing-flowdesk commented Jan 28, 2025

roeschter commented Feb 3, 2025

neilalexander commented Feb 3, 2025

MauriceVanVeen commented Feb 6, 2025

derekcollison commented Feb 6, 2025

jing-flowdesk commented Feb 7, 2025

kozlovic commented Feb 12, 2025

jing-flowdesk commented Jan 27, 2025 •

edited

Loading

neilalexander commented Jan 28, 2025 •

edited

Loading