Skip to content

Latest commit

 

History

History
62 lines (40 loc) · 2.06 KB

File metadata and controls

62 lines (40 loc) · 2.06 KB

🔀 Token-Level Routing for Edge Inference

Efficient collaborative decoding between edge-deployed small models and cloud-based large language models.

🎬 Demo

System Overview

🎥 Watch the demo:
Watch the demo


🧠 Overview

This project implements Token-Level Routing, a novel collaborative inference system where a small on-device model performs most decoding, selectively routing critical tokens to a powerful cloud-based LLM.

This approach significantly reduces latency and cost while retaining output quality — ideal for edge scenarios like mobile phones or IoT devices.


🚀 Key Features

  • Efficient: >60% accuracy boost by routing only ~7% of tokens to the LLM.
  • 🌐 Edge-Cloud Collaboration: Combines local lightweight models with cloud intelligence.
  • 🧭 Token-Level Routing: Fine-grained, confidence-driven token control.
  • 📱 Deployable: Lightweight ONNX runtime works on laptops and mobile devices.
  • 🖥️ LLM Backend: Compatible with [SGLang] for LLM serving and kv-cache extension.

🧩 Architecture

+-------------+           +-------------+           +-------------+
|  User Input |--Prompt-->|  SLM (ONNX) |--Tokens-->|   Router     |
+-------------+           +-------------+           +-------------+
                                                 |
                            Tokens with low confidence
                                                 v
                                      +------------------+
                                      | LLM (Server-side)|
                                      +------------------+

📘 Usage

See Guideline.md for setup and usage instructions.


💻 Platform Support

  • macOS (Apple M1/M2/M3) are already support
  • 🔧 Android under development!

📫 Contact

For questions or collaborations, feel free to open an issue or email us.