Drop-in CUDA Graph → HIP Graph translation layer for AMD gfx1030/1031 (RDNA2), featuring DeepSpeed-HIP inference kernels, safe eager fallback, dynamic-shape bucketing, and pure-Rust architectural contracts.
- Tier 1: pure-Python integration with monkey-patched
torch.cuda.CUDAGraph - Tier 2: native bridge for conditional nodes, rapid launch, and nested capture gaps
- Target: AMD Radeon RX 6700 XT / 6800 / 6900 class GPUs on ROCm
- Focus: transparent integration, safe fallback behavior, and practical performance on RDNA2
- Target Hardware
- Quick Start
- Two Operating Tiers
- Usage
- Architecture
- Observability
- Troubleshooting
- Current Capabilities & Performance
- Documentation
- License
| Component | Requirement |
|---|---|
| GPU | AMD Radeon RX 6700 XT / 6800 / 6900 (gfx1030, RDNA2) |
| ROCm | 7.2.0+ |
| PyTorch | 2.9+ (ROCm build) |
| Python | 3.12+ |
If you just want gfxGRAPH working with the fewest moving parts, start with Tier 1.
# Install PyTorch ROCm build
pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
# Install gfxGRAPH from repo root
pip install /path/to/gfxGRAPH
# Verify
python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"Expected result:
native_bridge: False- This is normal in Tier 1
- All Python-level features still work
pip install /path/to/gfxGRAPH
pip install /path/to/gfxGRAPH/native
python3 -c "import gfxgraph; print(gfxgraph.health_check())"Expected result:
native_bridge: True
The Rust crates (rs_gfxgraph, rs_gfxgraph_stats) provide zero-cost architectural contracts and fast-paths for graph routing. To build them from source during development:
# Ensure maturin is installed via your environment manager (e.g., uv)
# Build and install into the current environment
maturin develop --release --manifest-path rust/rs_gfxgraph/Cargo.toml
maturin develop --release --manifest-path rust/rs_gfxgraph_stats/Cargo.tomlgfxGRAPH works in two tiers depending on which dependencies you install. Most users only need Tier 1 because it provides the full Python-level integration, including the monkey-patch that makes CUDA graphs work transparently on RDNA2.
| Tier | Install Style | What You Get | Best For |
|---|---|---|---|
| Tier 1 | Pure Python | Monkey-patch, eager fallback, shape bucketing, validation, stats, health checks | Most users getting started |
| Tier 2 | Python + native companion | Native acceleration paths for routing, validation, and conditional helpers | Users who want lower Python overhead where available |
What you get:
torch.cuda.CUDAGraph → BridgedCUDAGraphmonkey-patch (transparent to callers)- Eager fallback — capture/replay failures never crash, just run slower
- Shape bucketing — reduced graph captures for dynamic batch sizes
- VRAM safety cap — prevents graph capture OOM (
GFXGRAPH_VRAM_CAP) - Validation mode — catches silent HIP Graph correctness bugs (PyTorch #155684)
- Thread-safe stats:
gfxgraph.stats()→ capture/replay/fallback counts - Health check:
gfxgraph.health_check()→ GPU info + smoke test - Structured logging:
HGB_LOG_LEVEL=debug|info|warn|error
Dependencies:
# That's it — just PyTorch (ROCm build) and Python
pip install torch --index-url https://download.pytorch.org/whl/rocm7.2Install gfxGRAPH:
# Preferred source install from repo root
pip install /path/to/gfxGRAPH
# Transitional compatibility path
pip install /path/to/gfxGRAPH/python/Verify:
python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"You'll see native_bridge: False — that's expected and fine. All Python-level
features work without the native library.
This is the advanced path and requires the ROCm SDK.
What you get additionally:
- Native helper paths for selected bridge components (
rs_gfxgraph,rs_gfxgraph_stats) - Optional
libhipgraph_bridge.soloading when present - Lower Python overhead on supported paths
System dependencies (Ubuntu/Debian):
# ROCm SDK — the big one. Follow AMD's official guide:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
#
# Key packages needed:
sudo apt-get install -y \
rocm-dev \
hip-dev \
hipcc \
rocm-cmake
# Build tools
sudo apt-get install -y cmake ninja-build
⚠️ ROCm SDK installation is non-trivial. It requires kernel-level drivers, specific package repositories, and careful version matching. Plan for 30-60 min on a fresh system. If you're running PyTorch ROCm builds, you likely already havelibamdhip64.so— but you still needhip-devheaders andhipccfor compiling the bridge.
cd /path/to/gfxGRAPH
cmake --preset release
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir build --output-on-failurepip install /path/to/gfxGRAPH
pip install /path/to/gfxGRAPH/nativepip install .[native] is intentionally not the supported source-install path
in this batch. Tier 2 stays a two-step flow so plain pip install /path/to/gfxGRAPH
remains a true pure-Python install.
gfxGRAPH checks GFXGRAPH_LIB first, then the canonical packaged resolver
gfxgraph._native.library_path(), then local build/ outputs, and finally
standard loader paths. During this phase the companion package still owns the
actual .so, but runtime code treats gfxgraph._native as the canonical lookup.
Verify native bridge loaded:
python3 -c "import gfxgraph; print(gfxgraph.health_check())"
# Should show: native_bridge: Trueimport gfxgraph
gfxgraph.enable() # patches torch.cuda.CUDAGraph globally
# Your existing CUDA graph code works unchanged:
graph = torch.cuda.CUDAGraph() # actually BridgedCUDAGraph
# ... capture_begin / capture_end / replay all delegate correctlygfxGRAPH integrates transparently with SGLang's CUDA graph runner. Set these environment variables before launching:
# Required: enable RDNA2 kernel paths (activates gfxGRAPH)
export SGLANG_RDNA2_KERNELS=1
# Required for gfx1031 (RX 6700 XT)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export PYTORCH_ROCM_ARCH=gfx1030
# Optional: validation mode (catches silent graph correctness bugs)
export GFXGRAPH=validate
# Optional: debug logging
export GFXGRAPH=debug
# Optional: VRAM cap for graph capture scratch (default 0.80 = 80% of total)
export GFXGRAPH_VRAM_CAP=0.80
# Optional: replay hot mode (skips replay-path diagnostics for lowest overhead)
export GFXGRAPH_REPLAY_HOT_MODE=1
# Optional: unified replay mode selection (standard|adaptive|hot)
# - standard: trusted replay + sampled diagnostics
# - adaptive: enables adaptive eager/graph selection and signature winner cache
# - hot: leanest replay path (minimum replay diagnostics)
export GFXGRAPH_REPLAY_MODE=adaptive
# Optional: standard-mode trusted replay tuning (safe fallback remains enabled)
export GFXGRAPH_TRUSTED_REPLAY_THRESHOLD=16
export GFXGRAPH_TRUSTED_REPLAY_SAMPLE_INTERVAL=16
# Optional: disable gfxGRAPH while keeping RDNA2 kernels
export SGLANG_DISABLE_GFXGRAPH=1
# Launch SGLang
python3 -m sglang.launch_server --model-path <model> ...SGLang logs gfxGRAPH status at startup:
INFO: gfxGRAPH v0.3.1 enabled (mode=normal, vram_cap=0.80)
INFO: gfxGRAPH health check passed: AMD Radeon RX 6700 XT (gfx1030), VRAM 10240MB free / 12288MB total
GFXGRAPH=1 python3 my_script.py # standard mode
GFXGRAPH=debug python3 my_script.py # verbose logging
GFXGRAPH=validate python3 my_script.py # correctness checking
GFXGRAPH_REPLAY_MODE=adaptive python3 my_script.py # adaptive eager/graph mode
GFXGRAPH_REPLAY_MODE=hot python3 my_script.py # lower-overhead replay path┌──────────────────────────────────────────────────────┐
│ User Application │
├──────────────┬───────────────────┬───────────────────┤
│ PyTorch │ Direct HIP C │ Unmodified CUDA │
├──────────────┼───────────────────┼───────────────────┤
│ Layer 2 │ │ Layer 3 │
│ hipgraph_ │ │ libcudagraph_ │
│ bridge/ │ │ compat.so │
│ (Python) │ │ (LD_PRELOAD) │
├──────────────┴───────────────────┴───────────────────┤
│ Layer 1: libhipgraph_bridge.so │
│ Gap bridges · Routing logic · Kernel pool │
├──────────────────────────────────────────────────────┤
│ libamdhip64.so (ROCm · 104 symbols) │
├──────────────────────────────────────────────────────┤
│ gfx1030 · RDNA2 Hardware │
└──────────────────────────────────────────────────────┘
| # | Gap | Bridge Strategy | Availability |
|---|---|---|---|
| 51 | Conditional nodes | Per-branch graph dispatch with eager fallback | Tier 1/2 |
| 52 | Device-side launch | Native launch-path helpers when bridge library is present | Tier 2 |
| 53 | Dynamic input shapes | Shape bucketing with VRAM-aware capture + replay | Tier 1/2 |
| 54 | Nested capture | Native nested-capture support when bridge library is present | Tier 2 |
| Tier | Stack | Intent |
|---|---|---|
| 0 | torch.compile only |
Baseline compiler path |
| 1 | HIP Graph + gfxGRAPH (Python-only) | Default production path |
| 2 | HIP Graph + gfxGRAPH (+ native companion) | Lower-overhead helper paths where available |
import gfxgraph
# Performance counters
gfxgraph.stats()
# → {'enabled_at': 1712..., 'capture_count': 32, 'replay_count': 1847,
# 'fallback_count': 0, 'validation_failures': 0, 'avg_replay_us': 42.3}
# Health check
gfxgraph.health_check()
# → {'ok': True, 'gpu': 'AMD Radeon RX 6700 XT', 'rocm': 'gfx1030',
# 'native_bridge': False, 'vram_total_mb': 12288, 'vram_free_mb': 10240,
# 'details': 'Graph capture/replay OK, output verified'}
# Status
gfxgraph.is_enabled() # → TrueExpected in Tier 1. gfxGRAPH runs in pure-Python mode — all key features work.
Build libhipgraph_bridge.so (see Tier 2 above) only if you need the 2 extra native-only gaps.
- Verify ROCm is working:
rocminfo | grep gfx - Check HSA override:
echo $HSA_OVERRIDE_GFX_VERSION(should be10.3.0for gfx1031) - Test PyTorch:
python3 -c "import torch; print(torch.cuda.is_available())" - Check for PyTorch #155684 (HIP Graph correctness bug) — use
GFXGRAPH=validate
- Set
AMD_SERIALIZE_KERNEL=3andAMD_SERIALIZE_COPY=3(SGLang sets these automatically) - Reduce
GFXGRAPH_VRAM_CAPif running near VRAM limits - Try
SGLANG_DISABLE_GFXGRAPH=1to isolate whether gfxGRAPH is the issue
- Some graph shapes may genuinely fail on HIP — eager fallback is intentional
- Check
HGB_LOG_LEVEL=debugfor detailed failure reasons - If all captures fail, the underlying HIP Graph support may be broken
BridgedCUDAGraphcapture/replay works on gfx1030 with eager fallback safety.- Dynamic-shape
ShapeBucketPoolcapture/replay works across bucketed batch sizes. ConditionalGraphbranch capture/replay works with fallback on per-branch failure.- Includes explicitly tuned RDNA2 (gfx1030)
deepspeed-hipinference kernels (layer norm, rms norm, tiled linear) and Triton kernels.
Run:
PYTHONPATH=python python benchmarks/bench_readme_public.py \
--run-count 3 \
--output benchmarks/results/readme_benchmark_latest.jsonResults from benchmarks/results/readme_benchmark_latest.json (standard mode):
| Workload | Eager (ms/iter) | Graph (ms/iter) | Status |
|---|---|---|---|
| decode_like_layernorm_gelu_chain_bs1_d1024 | 0.1395 | 0.1276 | 1.09x gain |
| mlp_bs32_d1024 | 0.1023 | 0.1028 | 1.00x parity |
| mlp_bs128_d2048 | 0.6128 | 0.6157 | 1.00x parity |
Optional with GFXGRAPH_REPLAY_HOT_MODE=1:
| Workload | Eager (ms/iter) | Graph (ms/iter) | Status |
|---|---|---|---|
| decode_like_layernorm_gelu_chain_bs1_d1024 | 0.1378 | 0.1335 | 1.03x gain |
| mlp_bs32_d1024 | 0.1022 | 0.1032 | 0.99x parity |
| mlp_bs128_d2048 | 0.6130 | 0.6138 | 1.00x parity |
Interpretation:
- Stability and Parity: The primary value is crash-free graph behavior with eager fallback safety.
- Modest Gains: We see modest performance gains on launch-bound decode workloads (e.g., 1.09x), with exact parity on compute-bound tasks, as expected on RDNA2.
- Standard mode now uses trusted replay promotion with sampled diagnostics and preserved eager fallback safety.
- Hot replay mode remains available when you want the leanest replay path and can accept reduced replay-path diagnostics.
- All measured runs above completed with
fallback: false(successful graph replay path). - Benchmark JSON now captures provenance (
commit_sha), ROCm runtime/driver hints, tracked environment variables, and repeated run samples for reproducibility.
MIT
