SGL-JAX: High-Performance LLM Inference on JAX/TPU

SGL-JAX is a high-performance, JAX-based inference engine for Large Language Models (LLMs), specifically optimized for Google TPUs. It is engineered from the ground up to deliver exceptional throughput and low latency for the most demanding LLM serving workloads.

The engine incorporates state-of-the-art techniques to maximize hardware utilization and serving efficiency, making it ideal for deploying large-scale models in production on TPUs.

Key Features

High-Throughput Continuous Batching: Implements a sophisticated scheduler that dynamically batches incoming requests, maximizing TPU utilization and overall throughput.
Optimized KV Cache with Radix Tree: Utilizes a Radix Tree for KV cache management (conceptually similar to PagedAttention), enabling memory-efficient prefix sharing between requests and significantly reducing computation for prompts with common prefixes.
FlashAttention Integration: Leverages a high-performance FlashAttention kernel for faster and more memory-efficient attention calculations, crucial for long sequences.
Tensor Parallelism: Natively supports tensor parallelism to distribute large models across multiple TPU devices, enabling inference for models that exceed the memory of a single accelerator.
OpenAI-Compatible API: Provides a drop-in replacement for the OpenAI API, allowing for seamless integration with a wide range of existing clients, SDKs, and tools (e.g., LangChain, LlamaIndex).
Native Qwen Support: Includes first-class, optimized support for the Qwen model family, including recent Mixture-of-Experts (MoE) variants.

Architecture Overview

SGL-JAX operates on a distributed architecture designed for scalability and performance:

HTTP Server: The entry point for all requests, compatible with the OpenAI API standard.
TokenizerManager: Runs in the main process, handles text tokenization
Scheduler: The core of the engine. It receives requests, manages prompts, and schedules token generation in batches. It intelligently groups requests to form optimal batches for the model executor.
TP Worker (Tensor Parallel Worker): A set of distributed workers that host the model weights, distributed via tensor parallelism. They execute the forward pass for the model.
Model Runner(Included in TP Worker): Manages the actual JAX-based model execution, including the forward pass, attention computation, and KV cache operations.
DetokenizerManager: Runs in a subprocess, handles output token decoding

More details in architecture.

Getting Started

Documentation

For more features and usage details, please read the documents in the docs directory.

Supported Models

SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for:

Qwen
Qwen 2
Qwen 2 MOE
Qwen 3
Qwen 3 MoE
Llama
Bailing MoE

Performance and Benchmarking

For detailed performance evaluation and to run the benchmarks yourself, please see the scripts located in the benchmark/ and python/sgl_jax/ directories (e.g., bench_serving.py).

Testing

The project includes a comprehensive test suite to ensure correctness and stability. To run the full suite of tests:

python test/srt/run_suite.py

Contributing

Contributions are welcome! If you would like to contribute, please feel free to open an issue to discuss your ideas or submit a pull request.

Before contributing, please read our Contribution Guide for setup instructions, coding standards, and contribution workflow.

You can also join our community on Slack to discuss ideas, get help, or collaborate with other contributors: 👉 Join the SGL-JAX Slack

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.gemini		.gemini
.github		.github
benchmark/kernels		benchmark/kernels
docs		docs
python		python
scripts		scripts
test		test
.codespellrc		.codespellrc
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SGL-JAX: High-Performance LLM Inference on JAX/TPU

Key Features

Architecture Overview

Getting Started

Documentation

Supported Models

Performance and Benchmarking

Testing

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 23

Languages

License

sgl-project/sglang-jax

Folders and files

Latest commit

History

Repository files navigation

SGL-JAX: High-Performance LLM Inference on JAX/TPU

Key Features

Architecture Overview

Getting Started

Documentation

Supported Models

Performance and Benchmarking

Testing

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 23

Languages

Packages