Summary (P1)
An ensemble issues N panel calls + 1 judge call per request, so it costs and consumes tokens several × a single model. Today there is no pre-flight cost/token control: budgets only act after the per-sub-call usage events are emitted, so a single ensemble request cannot be capped or degraded before it spends. Add a guardrail that degrades the ensemble to a single model (or fails) when a configured cap is exceeded.
This is the cost-budget guardrail in api7/AISIX-Cloud#804's Phase-2 roadmap ("cost-budget guardrail (max_cost_usd / token cap → degrade to a single model)").
Why
- Ensembles are gated Team+ and marketed as "several low-cost models stand in for a frontier model" — but without a cap, a large panel on long prompts can blow a budget far faster than operators expect.
- The N× multiplier makes ensembles the most important model kind to have a spend ceiling.
Proposed shape
- DP-native (token cap), v1: an optional cap on
EnsembleConfig (e.g. max_total_tokens or max_panel_calls); when the projected/accumulated cost exceeds it, the executor degrades — skip the panel and serve a single configured model (e.g. the judge or the first panel member), or fail with a clear error. No pricing data needed.
- CP-backed (
max_cost_usd): the managed budget controller already prices usage; wire a per-request max_cost_usd ceiling for ensembles that triggers the same degrade path. Needs CP pricing (the OSS proxy emits cost_usd=0).
- Make the degrade observable (a header / telemetry flag) so operators can see when it fired.
Open questions
- Degrade target: judge-only? first panel member? operator-configured fallback model (could reference a routing model for free failover)?
- Pre-flight estimate vs. mid-flight stop (panels run concurrently, so a mid-flight stop only helps the judge call).
Scope
DP executor (aisix-proxy) for the token-cap + degrade mechanism; CP for the max_cost_usd ceiling + surfacing the control in the dashboard form. Tracking: api7/AISIX-Cloud#804 (Phase 2).
Summary (P1)
An ensemble issues N panel calls + 1 judge call per request, so it costs and consumes tokens several × a single model. Today there is no pre-flight cost/token control: budgets only act after the per-sub-call usage events are emitted, so a single ensemble request cannot be capped or degraded before it spends. Add a guardrail that degrades the ensemble to a single model (or fails) when a configured cap is exceeded.
This is the cost-budget guardrail in api7/AISIX-Cloud#804's Phase-2 roadmap ("cost-budget guardrail (
max_cost_usd/ token cap → degrade to a single model)").Why
Proposed shape
EnsembleConfig(e.g.max_total_tokensormax_panel_calls); when the projected/accumulated cost exceeds it, the executor degrades — skip the panel and serve a single configured model (e.g. the judge or the first panel member), or fail with a clear error. No pricing data needed.max_cost_usd): the managed budget controller already prices usage; wire a per-requestmax_cost_usdceiling for ensembles that triggers the same degrade path. Needs CP pricing (the OSS proxy emitscost_usd=0).Open questions
Scope
DP executor (
aisix-proxy) for the token-cap + degrade mechanism; CP for themax_cost_usdceiling + surfacing the control in the dashboard form. Tracking: api7/AISIX-Cloud#804 (Phase 2).