Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making the producer stalling configurable #6413

Closed
jing-flowdesk opened this issue Jan 27, 2025 · 8 comments · Fixed by #6500
Closed

Making the producer stalling configurable #6413

jing-flowdesk opened this issue Jan 27, 2025 · 8 comments · Fixed by #6500
Labels
proposal Enhancement idea or proposal

Comments

@jing-flowdesk
Copy link

jing-flowdesk commented Jan 27, 2025

Proposed change

Making the producer stalling configurable.

Currently, when you have a slow consumer on a subject, it can make the server throttle the provider.
ex:

// If we are a client and we detect that the consumer we are
// sending to is in a stalled state, go ahead and wait here
// with a limit.
if c.kind == CLIENT && client.out.stc != nil {
client.stalledWait(c)
}

I read on another issue that it was to protect GW and Routes ?
The proposal would be to make this configurable by allowing people to switch this off if they need.

Use case

Better handling event spikes by not stalling the producers if there are a slow consumer on some topic.

Contribution

The change looks not huge here but we would love to have some more inputs before saying we can do it.

@jing-flowdesk jing-flowdesk added the proposal Enhancement idea or proposal label Jan 27, 2025
@neilalexander
Copy link
Member

neilalexander commented Jan 28, 2025

The stall gate here is to prevent the server from spiking in memory usage in queues when the subscriber is failing to keep up, otherwise the server could potentially OOM and lose it all anyway.

If your usage pattern features very spiky producers but stable or throttled subscribers, you may want to look at funnelling the data through streams instead.

@jing-flowdesk
Copy link
Author

Thank you @neilalexander ,

are you talking about https://docs.nats.io/nats-concepts/jetstream/streams ?
Jetstream unfortunately is not matching our needs because notably of the increased latency when activating it.

We were wondering if we could have "Fire and Forget" operating mode where the producer is publishing at its own speed and never getting throttled. The consumer may or may not receive the message depending of its healthness.

@roeschter
Copy link

Flowdesk is looking for (mostly) real time data distribution. More generally, data which is aging fast where old data is of little use.
In this context is the quality of service cannot no longer be guaranteed if the producer stalls.
The suggestion would be to male the behavior controllable (on/off).

@neilalexander
Copy link
Member

I think we can look into it.

@MauriceVanVeen
Copy link
Member

Tried out disabling the stalled wait and doing a bench where the subscriber is too slow. This pretty quickly makes the server the producer is connected to freeze/OOM. (Just because you can publish WAY faster than you can receive messages.)

So it's not as simple as just adding a toggle to either stall or not. But it should be a toggle that either:

  • does do stalledWait, and applies back-pressure to the producer when a subscriber is slow
  • does not stall, no back-pressure for the producer, but would need to drop messages that can't be delivered to a subscriber in time

Would need to figure out exactly which queues get backed up and ensure they don't grow too much.

@derekcollison
Copy link
Member

We have tests for this already for fan in and fan out. Could you describe your test setup a bit more?

@jing-flowdesk
Copy link
Author

Hello,

@MauriceVanVeen
The second behavior you are describing where it "does not stall, no back-pressure for the producer, but would need to drop messages that can't be delivered to a subscriber in time" could match our needs.
The message will be dropped for the specific slow consumer and other health consumers subscribed on the same topic would be able to receive the message right ? I understood that the writeloop is flushing to subscribers one by one in a synchronous way, meaning if one of them is slow consuming early in the list, it could impact other subscribers after it right ?

@derekcollison
For our test we used nats:2.10.20-alpine
The setup:

  • We had two producers reacting to the same event and publishing a timestamped data on their own topic.
    • Producer A on Topic A, Producer B on Topic B
  • We had multiple subscribers for each Topic
  • We simulated a slow consumer on Topic A
  • Producer A started to stall while Producer B was ok

kozlovic added a commit that referenced this issue Feb 11, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Feb 11, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Feb 11, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Feb 11, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Feb 11, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic
Copy link
Member

@derekcollison @neilalexander @MauriceVanVeen I think that there could be value in having a way to completely disable producer stalling in some situations as described by @jing-flowdesk. But of course, we can't simply ignore the stall and still attempt to deliver. Instead, if we drop the message for a slow consumer, this would allow the server to deliver it other non slow consumers. This is not the default behavior, so that should not impact users that do not want this behavior and prefer the current one.

I have the PR #6500 for consideration.

PS: I had issues with the tests running on Travis, so I had to tweak them several times...

derekcollison added a commit that referenced this issue Feb 12, 2025
…d) (#6500)

Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has 2
consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused the
producer to stall. The message is still delivered to consumers that are
not falling behind.

The option can be config-reload'ed and if a message is dropped due to
fast-producer/slow-consumer, and the message was traced (with deliver
option), then the message trace egress event will have an error
indicating the reason why the message was not delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
neilalexander pushed a commit that referenced this issue Feb 12, 2025
Normally, when a producer detects that one of the consumer of a message
is falling behind, it will stall. Which means that if a message has
2 consumers and the first is "slow", then it will affect the timely
delivery to the second consumer.

With the new option `no_fast_producer_stall=true`, the server will
simply drop a message destined to a consumer that would have caused
the producer to stall. The message is still delivered to consumers
that are not falling behind.

The option can be config-reload'ed and if a message is dropped
due to fast-producer/slow-consumer, and the message was traced
(with deliver option), then the message trace egress event will
have an error indicating the reason why the message was not
delivered.

Resolves #6413

Signed-off-by: Ivan Kozlovic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants