RFC: Gateway Metric Aggregator #404

kerthcet · 2025-05-08T10:31:53Z

What this PR does / why we need it

Which issue(s) this PR fixes

Part of #376

Special notes for your reviewer

Does this PR introduce a user-facing change?

RFC: Gateway Metric Aggregator

Signed-off-by: kerthcet <[email protected]>

kerthcet · 2025-05-08T10:32:37Z

cc @cr7258 @googs1025 who maybe interested with.

kerthcet · 2025-05-08T10:33:04Z

I may tune a bit tonight.

kerthcet · 2025-05-08T10:33:57Z

Forget to cc @nayihz as well.

Signed-off-by: kerthcet <[email protected]>

googs1025 · 2025-05-09T03:30:05Z

docs/proposals/376-metric-aggregagor/README.md

+
+Let's break down the flow into several steps:
+
+- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.


This is just a thought, not a question this solution: 🤔 If we use sidecar reporting, and this inference service is deployed by the user, if there is billing or resource usage, will the user be charged?

In addition, we must ensure that sidecar is basically not updated and has no errors. Imagine that if there is a bug fix in sidecar, will it affect all users who have already deployed it?

Changed the design with PULL mode. Copy the alternatives here:

- When collecting metrics from the inference workloads, `PUSH` mode will put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues. We didn't pick the approach because it will either add additional load to the inference workload and introduces more complexity to the system. The current approach will fork as much goroutines as the number of inference workloads to sync the metrics in parallel, this is feasible because goroutine is lightweight. Once the metrics aggregator becomes the bottleneck, we can consider to use `PUSH` mode at node level.

googs1025 · 2025-05-09T03:31:41Z

docs/proposals/376-metric-aggregagor/README.md

+### Additional components introduced:
+
+- Pod Sidecar: a sidecar container is necessary for each inference workload, which was introduced in Kubernetes 1.28 as alpha feature, and enabled by default in 1.29, see [details](https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/). The sidecar will be responsible for collecting the metrics and pushing them to the AI gateway. Let's set the interval time to 100ms at first.
+- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if not available.


When designing this part, we can make it more scalable. If we need to replace redis with other cache engines later, it will be more convenient. 😄

Yes, this is more about the implementation details. I didn't mention too much here.

googs1025 · 2025-05-09T03:35:40Z

docs/proposals/376-metric-aggregagor/README.md

+- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if not available.
+- Gateway Plugin: a new plugin or [DynamicLoadBalancingBackend](https://github.com/envoyproxy/ai-gateway/blob/be2b479b04bc7a219b0c8239143bfbabebdcd615/filterapi/filterconfig.go#L199-L208) specifically in Envoy AI gateway to pick the best-fit Pod endpoints. However, we may block by the upstream issue [here](https://github.com/envoyproxy/ai-gateway/issues/604), we'll work with the Envoy AI Gateway team to resolve it ASAP. Maybe the final design will impact our implementation a bit but not much I think.
+
+### Data Structure


When we consider the calculation of metrics, there is another point: there may be metrics that are expired or have not been reported. When we score, we need to consider this part to avoid invalid metrics affecting the final routing results.

Yes, this is the most annoying part, see updates.

nayihz · 2025-05-09T09:50:48Z

docs/proposals/376-metric-aggregagor/README.md

+
+- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.
+- Step 2: the gateway plugin will parse the metrics and store them in the redis, this is for HA consideration and cache sharing. Once the instance is down, we can still retrieve the metrics from redis. And if we have multiple instances, we can share the metrics with each other via redis. Considering Envoy AI gateway already uses Redis for limit rating, we'll reuse the Redis here.
+- Step 3 & 4: Traffic comes, and the Router will retrieve the metrics from Redis and make routing decisions based on different algorithms, like queue size aware scheduling.


IIUC, Router is gateway, right? If so, I think gateway is unambiguous.

changed to router across all the places.

nayihz · 2025-05-09T09:55:02Z

docs/proposals/376-metric-aggregagor/README.md

+
+Let's break down the flow into several steps:
+
+- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.


Are all metrics predefined by us or allow users defined by themselves?

Right now, we define the algo and the metrics come from the inference engines.

kerthcet · 2025-05-09T11:35:10Z

I'm a little busy today will update the PR this weekend.

kerthcet · 2025-05-09T15:45:44Z

/retest

kerthcet · 2025-05-12T10:15:37Z

The main blocking issue I met with right now is how to detect the Pod deletion in gateway, implement another controller would be too heavy.

googs1025 · 2025-05-12T12:55:42Z

I don't quite understand the question. Don't we need a new application similar to the deployment of the gateway plugin? 🤔

Signed-off-by: kerthcet <[email protected]>

kerthcet · 2025-05-13T06:51:25Z

I don't quite understand the question. Don't we need a new application similar to the deployment of the gateway plugin? 🤔

Because gateway will pick the Pod endpoint directly, what if the Pod is down but we didn't refresh our internal store? Then we'll route the traffic to a Pod doesn't exist. To solve this, I changed the design to use a Pod controller to reflect on the Pod status.

Please see the latest design.

nayihz · 2025-05-13T07:21:29Z

docs/proposals/376-metric-aggregagor/README.md

- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if not available.
- Gateway Plugin: a new plugin or [DynamicLoadBalancingBackend](https://github.com/envoyproxy/ai-gateway/blob/be2b479b04bc7a219b0c8239143bfbabebdcd615/filterapi/filterconfig.go#L199-L208) specifically in Envoy AI gateway to pick the best-fit Pod endpoints. However, we may block by the upstream issue [here](https://github.com/envoyproxy/ai-gateway/issues/604), we'll work with the Envoy AI Gateway team to resolve it ASAP. Maybe the final design will impact our implementation a bit but not much I think.
+- Metrics Aggregator (MA): MA is working as the controller plane to sync the metrics, this is also one of the reason why we want to decouple it from the router, which working as a data plane. MA has several components:
+  - A Pod controller to manage the Pod lifecycle, for example, once a Pod is ready, it will add it to the internal store, and each Pod will fork a background goroutine to sync the metrics continuously, 50ms interval by default. Once the Pod is deleted, the goroutine will be stopped and removed from the store.


we need some special handling in the scenario of PD-Disaggregation.

Yes, PD is special, definitely need extra works. What I want here is a base framework that we can iterate on later. Make sense to you?

Definitely.

nayihz · 2025-05-13T07:22:55Z

docs/proposals/376-metric-aggregagor/README.md

- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.
- Step 2: the gateway plugin will parse the metrics and store them in the redis, this is for HA consideration and cache sharing. Once the instance is down, we can still retrieve the metrics from redis. And if we have multiple instances, we can share the metrics with each other via redis. Considering Envoy AI gateway already uses Redis for limit rating, we'll reuse the Redis here.
- Step 3 & 4: Traffic comes, and the Router will retrieve the metrics from Redis and make routing decisions based on different algorithms, like queue size aware scheduling.
+- Step 1: we'll collect the metrics from the inference workloads in metrics aggregator.


Don't use 'PUSH' mode to collect metrics? I think Push mode is better.

Pull is more simple for small clusters, but we may employ Push in the future see ##Alternative.

kerthcet · 2025-05-14T02:36:56Z

/lgtm
/kind documentation
Let's focus on the implementation details then.

Add KEP: Gateway Metric Aggregator

f52123b

Signed-off-by: kerthcet <[email protected]>

kerthcet changed the title ~~Add KEP: Gateway Metric Aggregator~~ Proposal: Gateway Metric Aggregator May 8, 2025

kerthcet changed the title ~~Proposal: Gateway Metric Aggregator~~ RFC: Gateway Metric Aggregator May 8, 2025

Update goals

1b527df

Signed-off-by: kerthcet <[email protected]>

kerthcet changed the title ~~RFC: Gateway Metric Aggregator~~ [WIP] RFC: Gateway Metric Aggregator May 8, 2025

googs1025 reviewed May 9, 2025

View reviewed changes

nayihz reviewed May 9, 2025

View reviewed changes

Udpate

b8fe48d

Signed-off-by: kerthcet <[email protected]>

nayihz reviewed May 13, 2025

View reviewed changes

kerthcet changed the title ~~[WIP] RFC: Gateway Metric Aggregator~~ RFC: Gateway Metric Aggregator May 13, 2025

kerthcet mentioned this pull request May 13, 2025

Proposal for LoRA autoscaler #313

Open

InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. documentation Categorizes issue or PR as related to documentation. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 14, 2025

InftyAI-Agent assigned kerthcet May 14, 2025

InftyAI-Agent merged commit 105f9ef into InftyAI:main May 14, 2025
24 checks passed

kerthcet deleted the kep/metric-aggregator branch May 14, 2025 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Gateway Metric Aggregator #404

RFC: Gateway Metric Aggregator #404

kerthcet commented May 8, 2025 •

edited

Loading

kerthcet commented May 8, 2025

kerthcet commented May 8, 2025

kerthcet commented May 8, 2025

googs1025 May 9, 2025

googs1025 May 9, 2025

kerthcet May 13, 2025

googs1025 May 9, 2025

kerthcet May 13, 2025

googs1025 May 9, 2025

kerthcet May 13, 2025

nayihz May 9, 2025

kerthcet May 13, 2025

nayihz May 9, 2025

kerthcet May 13, 2025

kerthcet commented May 9, 2025

kerthcet commented May 9, 2025

kerthcet commented May 12, 2025

googs1025 commented May 12, 2025

kerthcet commented May 13, 2025

nayihz May 13, 2025

kerthcet May 13, 2025

nayihz May 13, 2025

nayihz May 13, 2025

kerthcet May 13, 2025

kerthcet commented May 14, 2025


		Let's break down the flow into several steps:

		- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.

RFC: Gateway Metric Aggregator #404

RFC: Gateway Metric Aggregator #404

Conversation

kerthcet commented May 8, 2025 • edited Loading

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

kerthcet commented May 8, 2025

kerthcet commented May 8, 2025

kerthcet commented May 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet commented May 9, 2025

kerthcet commented May 9, 2025

kerthcet commented May 12, 2025

googs1025 commented May 12, 2025

kerthcet commented May 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet commented May 14, 2025

kerthcet commented May 8, 2025 •

edited

Loading