-
Notifications
You must be signed in to change notification settings - Fork 5
feat: request back pressure logic enhancement proposal #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: request back pressure logic enhancement proposal #23
Conversation
@jorgeantonio21 Thanks for the dynamic rate limiting proposal. We are reviewing it. Please provide details on the current solution you are using so that we can understand the pain and help us determine prioritization. We are also welcoming public contributions. Let us know if you are interested in writing some of the codebase. |
@jorgeantonio21 - this is a well thought out proposal - we'll look into and assign a steward. |
Thanks @jorgeantonio21 again for this nice proposal! Here's a summary for the discussion today: Rate Limiter DesignFor an inference pipeline, if we reject the request in earlier stages, we have to make the rejection decision using less information, but we waste less resources in computing and communication. If we reject the request in later stages, we have more information that leads to an accurate decision, but we waste more resources. To tradeoff between accuracy and resource utilization, the rate limiter should reject requests as soon as the request will almost 100% fail the SLA based on the available information then. Hence, we propose a multi-level rate limiter:
Next stepsHere're the next steps, from short- to long-term:
|
Hi @jorgeantonio21, @ryanolson proposes that we can use kv load to do rejection at level 1. For example, if kv load of current in-flight requests is more than 1.25x of the available kv cache of all engines, we start to reject requests. WDYT? To achieve this, we need two inputs:
|
Load can be subscribed to. We had a metrics aggregator that @rmccorm4 had done. The idea is that many entities would benefit from tracking the kv load. Instead of having each component try to scrape it or aggregate it independently, the metrics aggregator would for certain high-demand metrics be continuously updating and publishing the most recent values. This way, interested parties would simply subscribe. |
@rmccorm4 could you please share some pointers? |
@tedzhouhk this "metrics component" was a proof-of-concept of what @ryanolson was describing: https://github.com/ai-dynamo/dynamo/tree/main/components/metrics. It's mostly untouched and untested since GTC, so may have some bugs, but pairing it with the mock worker should be pretty easy to play around with. It gathered metrics in two ways:
Lastly, it aggregated some of the metrics it gathered from all components via (1) the nats stats handler endpoints, and published out a metrics event of the aggregated metrics that could be theoretically consumed by something else - like multiple instances of kv router for a single shared aggregated view of the states/metrics or something. I don't think anything consumes this event today, just something that could theoretically be used. With the recent additions of metrics/observability going on in the runtime/llm bits from others, there may be more opportunities for hooking into places or more metrics information that can be gathered from other metrics endpoints. This metrics aggregation example is pretty dated at this point. Hope that helps. |
@rmccorm4 thanks a lot for the detailed info! Is the following correct? The |
@tedzhouhk - I thought our concern here was that metrics would be behind and would not allow us to handle bursts well. I suspect we'd want to have a high water mark for the system regardless? Would we run into the same behavior as with ttft / itl ema? |
@tedzhouhk The My original plan is to maintain a queue at the Router end if this "busy" error is triggered, but I guess we can pass it a flag to also just abort / preempt / cancel the request instead. What are your thoughts? Since the Router also has some (predictive) info on how loaded each engine is (batch-token-wise or kv-load-wise). It may also be able to make some intelligent guess as to what the optimal "retry" wait time is P.s. most of the active load predictive logic is in |
Yes we will, that's why we want to move ttft/itl based rejection to the engine level. |
@kthui - for his thoughts here as well. I think we should allow for the behavior to be modified to cancel instead of queue - maybe we can tweak this with a max queue size - |
@nnshah1 @tedzhouhk as discussed, scraping metrics directly from Prometheus clients will lag behind a few milliseconds (from my benchmarks roughly 20ms, or so). From my own experience, relying solely on scraping Prometheus client would oftentimes lead to my deployed llm inference services to miss out appropriate rate limiting (either blocking requests that could have been processed, or failed to block requests when the server was already overloaded). The current ema rate limiting logic, in the PR above, is very much as 'real time' as it can possibly be, since it collects metrics as long as they are computed (and not published on the collector). I am happy to refactor the PR to include other metrics, at the frontend level. But it makes sense to me that if the system catches too much of high load (either in the form of amortized TTFT, ITL or number of inflight requests), it should reject. Now, we could do this not by rate limiting all requests in the next few seconds, but a percentage of these (that could also be a configurable parameter). Happy to proceed on the best path decided. |
@@ -0,0 +1,800 @@ | |||
# Dynamo Rate Limit for Load Balancer Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we renamed this to
Dynamo Request Back Pressure or Dynamo Request Rejection or Dynamo Dynamic Back Pressure?
Rate Limiting I think has some idea for limiting clients to a certain number of requests per second - and this gets confused with higher layers in the stack that would impose such limits.
The proposal here is not so much a rate limit as it is a back pressure / request rejection proposal - IIUC - is that fair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nnshah1 this is totally right. I will rename the necessary items in the document
@PeaBrane @jorgeantonio21 @kthui - can we update here with the latest proposal - I believe we wanted to start with a limited scope and then and the additional phases as future looking. Also would like to use "Request Rejection" or "Preventing System Overload" vs rate limiting as the scheme here - let me know and I can help - want to merge this but with updates for the current direction. |
No description provided.