Skip to content

Conversation

jorgeantonio21
Copy link

No description provided.

@msharmavikram
Copy link

@jorgeantonio21 Thanks for the dynamic rate limiting proposal. We are reviewing it. Please provide details on the current solution you are using so that we can understand the pain and help us determine prioritization.

We are also welcoming public contributions. Let us know if you are interested in writing some of the codebase.

@nnshah1
Copy link
Contributor

nnshah1 commented Jul 9, 2025

@jorgeantonio21 - this is a well thought out proposal - we'll look into and assign a steward.

@tedzhouhk
Copy link

Thanks @jorgeantonio21 again for this nice proposal! Here's a summary for the discussion today:

Rate Limiter Design

For an inference pipeline, if we reject the request in earlier stages, we have to make the rejection decision using less information, but we waste less resources in computing and communication. If we reject the request in later stages, we have more information that leads to an accurate decision, but we waste more resources. To tradeoff between accuracy and resource utilization, the rate limiter should reject requests as soon as the request will almost 100% fail the SLA based on the available information then. Hence, we propose a multi-level rate limiter:

  • Level 1: ingress. At ingress level, we check the number of in-flight requests and compare with a pre-set threshold to decide if we should reject an incoming request.
  • Level 2: router. At router level, the request is already tokenized and prefix-cache hit estimation for each engine is also generated. The router will reject the request based on the prefill and decode engine's load by checking the prefill queue size and decode engines' kv cache utilization rate.
  • Level 3: engine. At engine level, the scheduler of the engine has more detailed and more accurate information on the status and load of the engine. If a request sits in the queue for too long or the load of the engine is too high, it will be rejected by the scheduler of the engine.

Next steps

Here're the next steps, from short- to long-term:

  1. Thanks again for @jorgeantonio21 to take this first stab! https://github.com/ai-dynamo/dynamo/pull/1949/files Since using EMA over TTFT/ITL might not provide accurate decisions, @jorgeantonio21 will modify this PR to use number of in-flight requests so that it will complete the level 1 ingress rate limiter.
  2. For level 2 router rate limiter, discuss with @PeaBrane on the design.
  3. Discuss on what event should we add to the SSE stream. Added events can provide more information for the status of each request in frontend/router (i.e., send to D, queued in D, queued in prefill queue, etc...).
  4. Fault tolerance when there are multiple frontends. Rough idea is to periodically sync status among the frontends. Need a more detailed design. @nnshah1
  5. Level 3 engine rate limiter design and implementation.

@tedzhouhk
Copy link

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?

To achieve this, we need two inputs:

  1. kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?
  2. available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

@ryanolson
Copy link

ryanolson commented Jul 17, 2025

Load can be subscribed to.

We had a metrics aggregator that @rmccorm4 had done.

The idea is that many entities would benefit from tracking the kv load.

Instead of having each component try to scrape it or aggregate it independently, the metrics aggregator would for certain high-demand metrics be continuously updating and publishing the most recent values.

This way, interested parties would simply subscribe.

@tedzhouhk
Copy link

@rmccorm4 could you please share some pointers?

@rmccorm4
Copy link

@tedzhouhk this "metrics component" was a proof-of-concept of what @ryanolson was describing: https://github.com/ai-dynamo/dynamo/tree/main/components/metrics. It's mostly untouched and untested since GTC, so may have some bugs, but pairing it with the mock worker should be pretty easy to play around with.

It gathered metrics in two ways:

  1. Periodically scraping nats service endpoints from each component namespace/component/load_metrics endpoint (ForwardPassMetrics)

  2. Pub/sub metrics events - only some kv cache related metrics (KVHitRateEvent) were added here for initial testing, but I think the "pub" part of the code originally in the Rust KV Router logic ended up getting removed when we pivoted to implementing kv router cost functions in Python. Since we've come back around to KV routing in Rust with dynamo-run ingress, maybe this approach or something similar could be used again.

Lastly, it aggregated some of the metrics it gathered from all components via (1) the nats stats handler endpoints, and published out a metrics event of the aggregated metrics that could be theoretically consumed by something else - like multiple instances of kv router for a single shared aggregated view of the states/metrics or something. I don't think anything consumes this event today, just something that could theoretically be used.

With the recent additions of metrics/observability going on in the runtime/llm bits from others, there may be more opportunities for hooking into places or more metrics information that can be gathered from other metrics endpoints. This metrics aggregation example is pretty dated at this point.

Hope that helps.

@tedzhouhk
Copy link

@rmccorm4 thanks a lot for the detailed info! Is the following correct? The ForwardPassMetric is generated in the forward pass process. When a burst of requests arrives, most of them will not be scheduled or even acknowledged by the worker. Hence the metrics will lag badly in this case. (that's the main reason we move to estimating kv block usage in router.)

@nnshah1
Copy link
Contributor

nnshah1 commented Jul 18, 2025

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?

To achieve this, we need two inputs:

  1. kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?
  2. available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

@tedzhouhk - I thought our concern here was that metrics would be behind and would not allow us to handle bursts well. I suspect we'd want to have a high water mark for the system regardless? Would we run into the same behavior as with ttft / itl ema?

@PeaBrane
Copy link

PeaBrane commented Jul 18, 2025

@tedzhouhk The KvScheduler can return a AllWorkersBusy errors when it determines that no workers can "handle" the new request. This is currently not hooked up, so the Router would just route the request regardless (which gets queued up at the engine end)

My original plan is to maintain a queue at the Router end if this "busy" error is triggered, but I guess we can pass it a flag to also just abort / preempt / cancel the request instead. What are your thoughts?

Since the Router also has some (predictive) info on how loaded each engine is (batch-token-wise or kv-load-wise). It may also be able to make some intelligent guess as to what the optimal "retry" wait time is

P.s. most of the active load predictive logic is in lib/llm/kv_router/sequence.rs and lib/llm/kv_router/scheduler.rs if interested

@tedzhouhk
Copy link

Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT?
To achieve this, we need two inputs:

  1. kv load of current in-flight requests: @PeaBrane implemented the logic to maintain this in the router, could you please share a pointer?
  2. available kv cache of all engines: engine client can put this information in etcd, we can help to implement this part.

@tedzhouhk - I thought our concern here was that metrics would be behind and would not allow us to handle bursts well. I suspect we'd want to have a high water mark for the system regardless? Would we run into the same behavior as with ttft / itl ema?

Yes we will, that's why we want to move ttft/itl based rejection to the engine level.

@nnshah1
Copy link
Contributor

nnshah1 commented Jul 20, 2025

@tedzhouhk The KvScheduler can return a AllWorkersBusy errors when it determines that no workers can "handle" the new request. This is currently not hooked up, so the Router would just route the request regardless (which gets queued up at the engine end)

My original plan is to maintain a queue at the Router end if this "busy" error is triggered, but I guess we can pass it a flag to also just abort / preempt / cancel the request instead. What are your thoughts?

Since the Router also has some (predictive) info on how loaded each engine is (batch-token-wise or kv-load-wise). It may also be able to make some intelligent guess as to what the optimal "retry" wait time is

P.s. most of the active load predictive logic is in lib/llm/kv_router/sequence.rs and lib/llm/kv_router/scheduler.rs if interested

@kthui - for his thoughts here as well.

I think we should allow for the behavior to be modified to cancel instead of queue - maybe we can tweak this with a max queue size -

@jorgeantonio21
Copy link
Author

@nnshah1 @tedzhouhk as discussed, scraping metrics directly from Prometheus clients will lag behind a few milliseconds (from my benchmarks roughly 20ms, or so). From my own experience, relying solely on scraping Prometheus client would oftentimes lead to my deployed llm inference services to miss out appropriate rate limiting (either blocking requests that could have been processed, or failed to block requests when the server was already overloaded).

The current ema rate limiting logic, in the PR above, is very much as 'real time' as it can possibly be, since it collects metrics as long as they are computed (and not published on the collector).

I am happy to refactor the PR to include other metrics, at the frontend level. But it makes sense to me that if the system catches too much of high load (either in the form of amortized TTFT, ITL or number of inflight requests), it should reject. Now, we could do this not by rate limiting all requests in the next few seconds, but a percentage of these (that could also be a configurable parameter).

Happy to proceed on the best path decided.

@@ -0,0 +1,800 @@
# Dynamo Rate Limit for Load Balancer Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we renamed this to

Dynamo Request Back Pressure or Dynamo Request Rejection or Dynamo Dynamic Back Pressure?

Rate Limiting I think has some idea for limiting clients to a certain number of requests per second - and this gets confused with higher layers in the stack that would impose such limits.

The proposal here is not so much a rate limit as it is a back pressure / request rejection proposal - IIUC - is that fair?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nnshah1 this is totally right. I will rename the necessary items in the document

@nnshah1
Copy link
Contributor

nnshah1 commented Aug 13, 2025

@PeaBrane @jorgeantonio21 @kthui - can we update here with the latest proposal - I believe we wanted to start with a limited scope and then and the additional phases as future looking. Also would like to use "Request Rejection" or "Preventing System Overload" vs rate limiting as the scheme here - let me know and I can help - want to merge this but with updates for the current direction.

@jorgeantonio21 jorgeantonio21 changed the title feat: rate limit logic enhancement proposal feat: request back pressure logic enhancement proposal Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants