feat: request back pressure logic enhancement proposal #23

jorgeantonio21 · 2025-07-04T16:02:11Z

No description provided.

msharmavikram · 2025-07-08T16:12:55Z

@jorgeantonio21 Thanks for the dynamic rate limiting proposal. We are reviewing it. Please provide details on the current solution you are using so that we can understand the pain and help us determine prioritization.

We are also welcoming public contributions. Let us know if you are interested in writing some of the codebase.

nnshah1 · 2025-07-09T15:09:43Z

@jorgeantonio21 - this is a well thought out proposal - we'll look into and assign a steward.

tedzhouhk · 2025-07-17T18:32:19Z

Thanks @jorgeantonio21 again for this nice proposal! Here's a summary for the discussion today:

Rate Limiter Design

For an inference pipeline, if we reject the request in earlier stages, we have to make the rejection decision using less information, but we waste less resources in computing and communication. If we reject the request in later stages, we have more information that leads to an accurate decision, but we waste more resources. To tradeoff between accuracy and resource utilization, the rate limiter should reject requests as soon as the request will almost 100% fail the SLA based on the available information then. Hence, we propose a multi-level rate limiter:

Level 1: ingress. At ingress level, we check the number of in-flight requests and compare with a pre-set threshold to decide if we should reject an incoming request.
Level 2: router. At router level, the request is already tokenized and prefix-cache hit estimation for each engine is also generated. The router will reject the request based on the prefill and decode engine's load by checking the prefill queue size and decode engines' kv cache utilization rate.
Level 3: engine. At engine level, the scheduler of the engine has more detailed and more accurate information on the status and load of the engine. If a request sits in the queue for too long or the load of the engine is too high, it will be rejected by the scheduler of the engine.

Next steps

Here're the next steps, from short- to long-term:

Thanks again for @jorgeantonio21 to take this first stab! https://github.com/ai-dynamo/dynamo/pull/1949/files Since using EMA over TTFT/ITL might not provide accurate decisions, @jorgeantonio21 will modify this PR to use number of in-flight requests so that it will complete the level 1 ingress rate limiter.
For level 2 router rate limiter, discuss with @PeaBrane on the design.
Discuss on what event should we add to the SSE stream. Added events can provide more information for the status of each request in frontend/router (i.e., send to D, queued in D, queued in prefill queue, etc...).
Fault tolerance when there are multiple frontends. Rough idea is to periodically sync status among the frontends. Need a more detailed design. @nnshah1
Level 3 engine rate limiter design and implementation.

tedzhouhk · 2025-07-17T22:09:05Z

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?

To achieve this, we need two inputs:

kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?
available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

ryanolson · 2025-07-17T22:14:27Z

Load can be subscribed to.

We had a metrics aggregator that @rmccorm4 had done.

The idea is that many entities would benefit from tracking the kv load.

Instead of having each component try to scrape it or aggregate it independently, the metrics aggregator would for certain high-demand metrics be continuously updating and publishing the most recent values.

This way, interested parties would simply subscribe.

tedzhouhk · 2025-07-17T22:15:45Z

@rmccorm4 could you please share some pointers?

rmccorm4 · 2025-07-17T22:42:48Z

@tedzhouhk this "metrics component" was a proof-of-concept of what @ryanolson was describing: https://github.com/ai-dynamo/dynamo/tree/main/components/metrics. It's mostly untouched and untested since GTC, so may have some bugs, but pairing it with the mock worker should be pretty easy to play around with.

It gathered metrics in two ways:

Periodically scraping nats service endpoints from each component namespace/component/load_metrics endpoint (ForwardPassMetrics)
- example mock worker publishing fake ForwardPassMetrics on nats stats handler: https://github.com/ai-dynamo/dynamo/blob/0dfca2cb4a33a0b21850b74b80c05d361a5565bc/components/metrics/src/bin/mock_worker.rs#L111-L143
- example of scraping nats stats handler load_metrics endpoints: https://github.com/ai-dynamo/dynamo/blob/0dfca2cb4a33a0b21850b74b80c05d361a5565bc/components/metrics/src/main.rs#L222-L232
Pub/sub metrics events - only some kv cache related metrics (KVHitRateEvent) were added here for initial testing, but I think the "pub" part of the code originally in the Rust KV Router logic ended up getting removed when we pivoted to implementing kv router cost functions in Python. Since we've come back around to KV routing in Rust with dynamo-run ingress, maybe this approach or something similar could be used again.
- example mock worker publishing fake kv hit rate metric events: https://github.com/ai-dynamo/dynamo/blob/0dfca2cb4a33a0b21850b74b80c05d361a5565bc/components/metrics/src/bin/mock_worker.rs#L93-L99
- example of subscribing to kv hit rate events: https://github.com/ai-dynamo/dynamo/blob/0dfca2cb4a33a0b21850b74b80c05d361a5565bc/components/metrics/src/main.rs#L178

Lastly, it aggregated some of the metrics it gathered from all components via (1) the nats stats handler endpoints, and published out a metrics event of the aggregated metrics that could be theoretically consumed by something else - like multiple instances of kv router for a single shared aggregated view of the states/metrics or something. I don't think anything consumes this event today, just something that could theoretically be used.

With the recent additions of metrics/observability going on in the runtime/llm bits from others, there may be more opportunities for hooking into places or more metrics information that can be gathered from other metrics endpoints. This metrics aggregation example is pretty dated at this point.

Hope that helps.

tedzhouhk · 2025-07-17T22:59:06Z

@rmccorm4 thanks a lot for the detailed info! Is the following correct? The ForwardPassMetric is generated in the forward pass process. When a burst of requests arrives, most of them will not be scheduled or even acknowledged by the worker. Hence the metrics will lag badly in this case. (that's the main reason we move to estimating kv block usage in router.)

nnshah1 · 2025-07-18T00:05:47Z

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?

To achieve this, we need two inputs:

kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?

available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

@tedzhouhk - I thought our concern here was that metrics would be behind and would not allow us to handle bursts well. I suspect we'd want to have a high water mark for the system regardless? Would we run into the same behavior as with ttft / itl ema?

PeaBrane · 2025-07-18T03:46:32Z

@tedzhouhk The KvScheduler can return a AllWorkersBusy errors when it determines that no workers can "handle" the new request. This is currently not hooked up, so the Router would just route the request regardless (which gets queued up at the engine end)

My original plan is to maintain a queue at the Router end if this "busy" error is triggered, but I guess we can pass it a flag to also just abort / preempt / cancel the request instead. What are your thoughts?

Since the Router also has some (predictive) info on how loaded each engine is (batch-token-wise or kv-load-wise). It may also be able to make some intelligent guess as to what the optimal "retry" wait time is

P.s. most of the active load predictive logic is in lib/llm/kv_router/sequence.rs and lib/llm/kv_router/scheduler.rs if interested

tedzhouhk · 2025-07-18T04:50:26Z

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?
To achieve this, we need two inputs:

kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?

available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

@tedzhouhk - I thought our concern here was that metrics would be behind and would not allow us to handle bursts well. I suspect we'd want to have a high water mark for the system regardless? Would we run into the same behavior as with ttft / itl ema?

Yes we will, that's why we want to move ttft/itl based rejection to the engine level.

nnshah1 · 2025-07-20T01:17:21Z

@tedzhouhk The KvScheduler can return a AllWorkersBusy errors when it determines that no workers can "handle" the new request. This is currently not hooked up, so the Router would just route the request regardless (which gets queued up at the engine end)

My original plan is to maintain a queue at the Router end if this "busy" error is triggered, but I guess we can pass it a flag to also just abort / preempt / cancel the request instead. What are your thoughts?

Since the Router also has some (predictive) info on how loaded each engine is (batch-token-wise or kv-load-wise). It may also be able to make some intelligent guess as to what the optimal "retry" wait time is

P.s. most of the active load predictive logic is in lib/llm/kv_router/sequence.rs and lib/llm/kv_router/scheduler.rs if interested

@kthui - for his thoughts here as well.

I think we should allow for the behavior to be modified to cancel instead of queue - maybe we can tweak this with a max queue size -

jorgeantonio21 · 2025-07-21T14:24:35Z

@nnshah1 @tedzhouhk as discussed, scraping metrics directly from Prometheus clients will lag behind a few milliseconds (from my benchmarks roughly 20ms, or so). From my own experience, relying solely on scraping Prometheus client would oftentimes lead to my deployed llm inference services to miss out appropriate rate limiting (either blocking requests that could have been processed, or failed to block requests when the server was already overloaded).

The current ema rate limiting logic, in the PR above, is very much as 'real time' as it can possibly be, since it collects metrics as long as they are computed (and not published on the collector).

I am happy to refactor the PR to include other metrics, at the frontend level. But it makes sense to me that if the system catches too much of high load (either in the form of amortized TTFT, ITL or number of inflight requests), it should reject. Now, we could do this not by rate limiting all requests in the next few seconds, but a percentage of these (that could also be a configurable parameter).

Happy to proceed on the best path decided.

nnshah1 · 2025-07-31T18:20:56Z

deps/NNNN-rate-limiting-logic.md

@@ -0,0 +1,800 @@
+# Dynamo Rate Limit for Load Balancer Proposal


What if we renamed this to

Dynamo Request Back Pressure or Dynamo Request Rejection or Dynamo Dynamic Back Pressure?

Rate Limiting I think has some idea for limiting clients to a certain number of requests per second - and this gets confused with higher layers in the stack that would impose such limits.

The proposal here is not so much a rate limit as it is a back pressure / request rejection proposal - IIUC - is that fair?

@nnshah1 this is totally right. I will rename the necessary items in the document

nnshah1 · 2025-08-13T15:35:53Z

@PeaBrane @jorgeantonio21 @kthui - can we update here with the latest proposal - I believe we wanted to start with a limited scope and then and the additional phases as future looking. Also would like to use "Request Rejection" or "Preventing System Overload" vs rate limiting as the scheme here - let me know and I can help - want to merge this but with updates for the current direction.

jorgeantonio21 added 2 commits July 4, 2025 17:01

rate limit logic enhancement proposal

5783abd

enhance md format

d36693c

msharmavikram assigned harryskim and nnshah1 Jul 8, 2025

msharmavikram added the enhancement New feature or request label Jul 8, 2025

msharmavikram assigned tedzhouhk Jul 8, 2025

nnshah1 added the external contributor label Jul 9, 2025

jorgeantonio21 mentioned this pull request Jul 15, 2025

feat: add rate limiter logic to dynamo's openai api compatible http service (v1) ai-dynamo/dynamo#1949

Closed

nnshah1 reviewed Jul 31, 2025

View reviewed changes

jorgeantonio21 changed the title ~~feat: rate limit logic enhancement proposal~~ feat: request back pressure logic enhancement proposal Aug 15, 2025

		@@ -0,0 +1,800 @@
		# Dynamo Rate Limit for Load Balancer Proposal

feat: request back pressure logic enhancement proposal #23

Are you sure you want to change the base?

feat: request back pressure logic enhancement proposal #23

Uh oh!

Conversation

jorgeantonio21 commented Jul 4, 2025

Uh oh!

msharmavikram commented Jul 8, 2025

Uh oh!

nnshah1 commented Jul 9, 2025

Uh oh!

tedzhouhk commented Jul 17, 2025

Rate Limiter Design

Next steps

Uh oh!

tedzhouhk commented Jul 17, 2025

Uh oh!

ryanolson commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tedzhouhk commented Jul 17, 2025

Uh oh!

rmccorm4 commented Jul 17, 2025

Uh oh!

tedzhouhk commented Jul 17, 2025

Uh oh!

nnshah1 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeaBrane commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tedzhouhk commented Jul 18, 2025

Uh oh!

nnshah1 commented Jul 20, 2025

Uh oh!

jorgeantonio21 commented Jul 21, 2025

Uh oh!

nnshah1 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

jorgeantonio21 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 commented Aug 13, 2025

Uh oh!

Uh oh!

ryanolson commented Jul 17, 2025 •

edited

Loading

nnshah1 commented Jul 18, 2025 •

edited

Loading

PeaBrane commented Jul 18, 2025 •

edited

Loading