feat: Add R2-Router submission#68
Conversation
R2-Router is a category-aware LLM router that uses Ridge regression to predict per-query quality scores for 4 LLMs (Qwen3-235B, Qwen3-80B, Gemini Flash, Claude Haiku) across 9 token budgets. Routes via risk = (1-lambda)*quality - lambda*cost with shrinkage toward category means. - Config: 4-model pool with lambda=0.999 - Predictions: 8400 regular + 2427 optimality entries (10827 total) - Robustness: 420 entries - Model registrations: Qwen3-235B and Qwen3-80B added to universal_model_names - Cost configs: Added pricing for both Qwen3 models Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Happy <[email protected]>
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
- gemini-2.5-flash → gemini-2.0-flash-001 (actual OpenRouter API) - claude-3-haiku-20240307 → claude-haiku-4.5 (actual OpenRouter API) - Added claude-haiku-4.5 to universal_model_names and model_cost.json Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Happy <[email protected]>
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
|
Hi @jqxue1999, thanks for evaluating your router using our RouterArena. If the results look good to you, we will go ahead to post it on our website and README. |
|
Hi, thanks again for providing the RouterArena evaluation support. We are currently preparing a new set of results with some additional models integrated, and the performance may improve further. Would it be possible to hold off on posting the current results for now? We will share the updated evaluation with you very soon. Thanks a lot for your patience and support! |
OK. Sounds good. Then, I converted this PR to draft. |
- Switch from Ridge regression to Global KNN (K=28, cosine, distance-weighted) - Train on sub_10 split (809 queries), route all 8400 - Pool: Qwen3-235B (72.8%), Gemini 2.5 Flash (19.8%), Ministral-3B (7.4%) - Acc=70.64%, Cost=$0.0496/1kq, Arena(β=0.1)=71.21 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Happy <[email protected]>
Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Happy <[email protected]>
0d89c02 to
d9c5f98
Compare
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
2 similar comments
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
- Models: 235b, 80b, 30b, coder-next, gemini-flash, haiku - Budgets: concise, budget_200, budget_400, budget_800 - Training: sub_10 only (809 queries), Global KNN (cosine, distance-weighted) - Results: Acc=71.20%, Cost=$0.037/1kq, Arena(β=0.1)=71.94 - Beats Method 1 and Method 2 across all β values Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Happy <[email protected]>
Router Evaluation ResultsRouter: RouterArena Metrics
Optimality Metrics
Evaluation completed by RouterArena automated workflow |
|
Thanks for your patience. We’ve finished preparing the updated results and they’re ready now. Please feel free to proceed with posting them. Let us know if you need anything else from our side. Thanks again! |
Summary
Files
router_inference/config/r2-router.jsonrouter_inference/predictions/r2-router.jsonrouter_inference/predictions/r2-router-robustness.jsonuniversal_model_names.pymodel_cost/model_cost.jsonEstimated Metrics
Validation
All checks pass:
check_config_prediction_files.py r2-router full --check-generated-result✓check_config_prediction_files.py r2-router robustness --check-generated-result✓