gfxGRAPH v0.3.4

Drop-in CUDA Graph → HIP Graph translation layer for AMD gfx1030/1031 (RDNA2), featuring DeepSpeed-HIP inference kernels, safe eager fallback, dynamic-shape bucketing, and pure-Rust architectural contracts.

At a Glance

Tier 1: pure-Python integration with monkey-patched torch.cuda.CUDAGraph
Tier 2: native bridge for conditional nodes, rapid launch, and nested capture gaps
Target: AMD Radeon RX 6700 XT / 6800 / 6900 class GPUs on ROCm
Focus: transparent integration, safe fallback behavior, and practical performance on RDNA2

Target Hardware

Component	Requirement
GPU	AMD Radeon RX 6700 XT / 6800 / 6900 (gfx1030, RDNA2)
ROCm	7.2.0+
PyTorch	2.9+ (ROCm build)
Python	3.12+

Quick Start

If you just want gfxGRAPH working with the fewest moving parts, start with Tier 1.

Fastest Path: Tier 1

# Install PyTorch ROCm build
pip install torch --index-url https://download.pytorch.org/whl/rocm7.2

# Install gfxGRAPH from repo root
pip install /path/to/gfxGRAPH

# Verify
python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"

Expected result:

native_bridge: False
This is normal in Tier 1
All Python-level features still work

Native Path: Tier 2

pip install /path/to/gfxGRAPH
pip install /path/to/gfxGRAPH/native

python3 -c "import gfxgraph; print(gfxgraph.health_check())"

Expected result:

native_bridge: True

Building the Rust Accelerators

The Rust crates (rs_gfxgraph, rs_gfxgraph_stats) provide zero-cost architectural contracts and fast-paths for graph routing. To build them from source during development:

# Ensure maturin is installed via your environment manager (e.g., uv)
# Build and install into the current environment
maturin develop --release --manifest-path rust/rs_gfxgraph/Cargo.toml
maturin develop --release --manifest-path rust/rs_gfxgraph_stats/Cargo.toml

Two Operating Tiers

gfxGRAPH works in two tiers depending on which dependencies you install. Most users only need Tier 1 because it provides the full Python-level integration, including the monkey-patch that makes CUDA graphs work transparently on RDNA2.

Tier Comparison

Tier	Install Style	What You Get	Best For
Tier 1	Pure Python	Monkey-patch, eager fallback, shape bucketing, validation, stats, health checks	Most users getting started
Tier 2	Python + native companion	Native acceleration paths for routing, validation, and conditional helpers	Users who want lower Python overhead where available

Tier 1: Python-Only Mode

What you get:

torch.cuda.CUDAGraph → BridgedCUDAGraph monkey-patch (transparent to callers)
Eager fallback — capture/replay failures never crash, just run slower
Shape bucketing — reduced graph captures for dynamic batch sizes
VRAM safety cap — prevents graph capture OOM (GFXGRAPH_VRAM_CAP)
Validation mode — catches silent HIP Graph correctness bugs (PyTorch #155684)
Thread-safe stats: gfxgraph.stats() → capture/replay/fallback counts
Health check: gfxgraph.health_check() → GPU info + smoke test
Structured logging: HGB_LOG_LEVEL=debug|info|warn|error

Dependencies:

# That's it — just PyTorch (ROCm build) and Python
pip install torch --index-url https://download.pytorch.org/whl/rocm7.2

Install gfxGRAPH:

# Preferred source install from repo root
pip install /path/to/gfxGRAPH

# Transitional compatibility path
pip install /path/to/gfxGRAPH/python/

Verify:

python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"

You'll see native_bridge: False — that's expected and fine. All Python-level features work without the native library.

Tier 2: Full Native Mode

This is the advanced path and requires the ROCm SDK.

What you get additionally:

Native helper paths for selected bridge components (rs_gfxgraph, rs_gfxgraph_stats)
Optional libhipgraph_bridge.so loading when present
Lower Python overhead on supported paths

System dependencies (Ubuntu/Debian):

# ROCm SDK — the big one. Follow AMD's official guide:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
#
# Key packages needed:
sudo apt-get install -y \
    rocm-dev \
    hip-dev \
    hipcc \
    rocm-cmake

# Build tools
sudo apt-get install -y cmake ninja-build

⚠️ ROCm SDK installation is non-trivial. It requires kernel-level drivers, specific package repositories, and careful version matching. Plan for 30-60 min on a fresh system. If you're running PyTorch ROCm builds, you likely already have libamdhip64.so — but you still need hip-dev headers and hipcc for compiling the bridge.

Option A: Build the Native Bridge Locally

cd /path/to/gfxGRAPH

cmake --preset release
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure

Option B: Install the Native Companion Package

pip install /path/to/gfxGRAPH
pip install /path/to/gfxGRAPH/native

pip install .[native] is intentionally not the supported source-install path in this batch. Tier 2 stays a two-step flow so plain pip install /path/to/gfxGRAPH remains a true pure-Python install.

gfxGRAPH checks GFXGRAPH_LIB first, then the canonical packaged resolver gfxgraph._native.library_path(), then local build/ outputs, and finally standard loader paths. During this phase the companion package still owns the actual .so, but runtime code treats gfxgraph._native as the canonical lookup.

Verify native bridge loaded:

python3 -c "import gfxgraph; print(gfxgraph.health_check())"
# Should show: native_bridge: True

Usage

Standalone (any PyTorch code)

import gfxgraph
gfxgraph.enable()  # patches torch.cuda.CUDAGraph globally

# Your existing CUDA graph code works unchanged:
graph = torch.cuda.CUDAGraph()  # actually BridgedCUDAGraph
# ... capture_begin / capture_end / replay all delegate correctly

With SGLang

gfxGRAPH integrates transparently with SGLang's CUDA graph runner. Set these environment variables before launching:

# Required: enable RDNA2 kernel paths (activates gfxGRAPH)
export SGLANG_RDNA2_KERNELS=1

# Required for gfx1031 (RX 6700 XT)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export PYTORCH_ROCM_ARCH=gfx1030

# Optional: validation mode (catches silent graph correctness bugs)
export GFXGRAPH=validate

# Optional: debug logging
export GFXGRAPH=debug

# Optional: VRAM cap for graph capture scratch (default 0.80 = 80% of total)
export GFXGRAPH_VRAM_CAP=0.80

# Optional: replay hot mode (skips replay-path diagnostics for lowest overhead)
export GFXGRAPH_REPLAY_HOT_MODE=1

# Optional: unified replay mode selection (standard|adaptive|hot)
# - standard: trusted replay + sampled diagnostics
# - adaptive: enables adaptive eager/graph selection and signature winner cache
# - hot: leanest replay path (minimum replay diagnostics)
export GFXGRAPH_REPLAY_MODE=adaptive

# Optional: standard-mode trusted replay tuning (safe fallback remains enabled)
export GFXGRAPH_TRUSTED_REPLAY_THRESHOLD=16
export GFXGRAPH_TRUSTED_REPLAY_SAMPLE_INTERVAL=16

# Optional: disable gfxGRAPH while keeping RDNA2 kernels
export SGLANG_DISABLE_GFXGRAPH=1

# Launch SGLang
python3 -m sglang.launch_server --model-path <model> ...

SGLang logs gfxGRAPH status at startup:

INFO: gfxGRAPH v0.3.1 enabled (mode=normal, vram_cap=0.80)
INFO: gfxGRAPH health check passed: AMD Radeon RX 6700 XT (gfx1030), VRAM 10240MB free / 12288MB total

Via Environment Variable (auto-enables on import)

GFXGRAPH=1 python3 my_script.py        # standard mode
GFXGRAPH=debug python3 my_script.py    # verbose logging
GFXGRAPH=validate python3 my_script.py # correctness checking
GFXGRAPH_REPLAY_MODE=adaptive python3 my_script.py # adaptive eager/graph mode
GFXGRAPH_REPLAY_MODE=hot python3 my_script.py      # lower-overhead replay path

Architecture

┌──────────────────────────────────────────────────────┐
│                   User Application                    │
├──────────────┬───────────────────┬───────────────────┤
│   PyTorch    │   Direct HIP C   │  Unmodified CUDA  │
├──────────────┼───────────────────┼───────────────────┤
│  Layer 2     │                   │  Layer 3          │
│  hipgraph_   │                   │  libcudagraph_    │
│  bridge/     │                   │  compat.so        │
│  (Python)    │                   │  (LD_PRELOAD)     │
├──────────────┴───────────────────┴───────────────────┤
│            Layer 1: libhipgraph_bridge.so             │
│     Gap bridges · Routing logic · Kernel pool         │
├──────────────────────────────────────────────────────┤
│         libamdhip64.so  (ROCm · 104 symbols)          │
├──────────────────────────────────────────────────────┤
│              gfx1030 · RDNA2 Hardware                 │
└──────────────────────────────────────────────────────┘

Gaps Bridged

#	Gap	Bridge Strategy	Availability
51	Conditional nodes	Per-branch graph dispatch with eager fallback	Tier 1/2
52	Device-side launch	Native launch-path helpers when bridge library is present	Tier 2
53	Dynamic input shapes	Shape bucketing with VRAM-aware capture + replay	Tier 1/2
54	Nested capture	Native nested-capture support when bridge library is present	Tier 2

Routing Strategy

Tier	Stack	Intent
0	`torch.compile` only	Baseline compiler path
1	HIP Graph + gfxGRAPH (Python-only)	Default production path
2	HIP Graph + gfxGRAPH (+ native companion)	Lower-overhead helper paths where available

Observability

import gfxgraph

# Performance counters
gfxgraph.stats()
# → {'enabled_at': 1712..., 'capture_count': 32, 'replay_count': 1847,
#     'fallback_count': 0, 'validation_failures': 0, 'avg_replay_us': 42.3}

# Health check
gfxgraph.health_check()
# → {'ok': True, 'gpu': 'AMD Radeon RX 6700 XT', 'rocm': 'gfx1030',
#     'native_bridge': False, 'vram_total_mb': 12288, 'vram_free_mb': 10240,
#     'details': 'Graph capture/replay OK, output verified'}

# Status
gfxgraph.is_enabled()  # → True

Troubleshooting

"Native bridge not available" message at startup

Expected in Tier 1. gfxGRAPH runs in pure-Python mode — all key features work. Build libhipgraph_bridge.so (see Tier 2 above) only if you need the 2 extra native-only gaps.

Health check returns `ok: False`

Verify ROCm is working: rocminfo | grep gfx
Check HSA override: echo $HSA_OVERRIDE_GFX_VERSION (should be 10.3.0 for gfx1031)
Test PyTorch: python3 -c "import torch; print(torch.cuda.is_available())"
Check for PyTorch #155684 (HIP Graph correctness bug) — use GFXGRAPH=validate

CUDA graphs fail during SGLang model loading

Set AMD_SERIALIZE_KERNEL=3 and AMD_SERIALIZE_COPY=3 (SGLang sets these automatically)
Reduce GFXGRAPH_VRAM_CAP if running near VRAM limits
Try SGLANG_DISABLE_GFXGRAPH=1 to isolate whether gfxGRAPH is the issue

Fallback count keeps increasing

Some graph shapes may genuinely fail on HIP — eager fallback is intentional
Check HGB_LOG_LEVEL=debug for detailed failure reasons
If all captures fail, the underlying HIP Graph support may be broken

Current Capabilities & Performance (v0.3.4)

Verified capability snapshot

BridgedCUDAGraph capture/replay works on gfx1030 with eager fallback safety.
Dynamic-shape ShapeBucketPool capture/replay works across bucketed batch sizes.
ConditionalGraph branch capture/replay works with fallback on per-branch failure.
Includes explicitly tuned RDNA2 (gfx1030) deepspeed-hip inference kernels (layer norm, rms norm, tiled linear) and Triton kernels.

Public benchmark (RX 6700 XT / gfx1030, ROCm 7.2, torch 2.11.0+rocm7.2)

Run:

PYTHONPATH=python python benchmarks/bench_readme_public.py \
  --run-count 3 \
  --output benchmarks/results/readme_benchmark_latest.json

Results from benchmarks/results/readme_benchmark_latest.json (standard mode):

Workload	Eager (ms/iter)	Graph (ms/iter)	Status
decode_like_layernorm_gelu_chain_bs1_d1024	0.1395	0.1276	1.09x gain
mlp_bs32_d1024	0.1023	0.1028	1.00x parity
mlp_bs128_d2048	0.6128	0.6157	1.00x parity

Optional with GFXGRAPH_REPLAY_HOT_MODE=1:

Workload	Eager (ms/iter)	Graph (ms/iter)	Status
decode_like_layernorm_gelu_chain_bs1_d1024	0.1378	0.1335	1.03x gain
mlp_bs32_d1024	0.1022	0.1032	0.99x parity
mlp_bs128_d2048	0.6130	0.6138	1.00x parity

Interpretation:

Stability and Parity: The primary value is crash-free graph behavior with eager fallback safety.
Modest Gains: We see modest performance gains on launch-bound decode workloads (e.g., 1.09x), with exact parity on compute-bound tasks, as expected on RDNA2.
Standard mode now uses trusted replay promotion with sampled diagnostics and preserved eager fallback safety.
Hot replay mode remains available when you want the leanest replay path and can accept reduced replay-path diagnostics.
All measured runs above completed with fallback: false (successful graph replay path).
Benchmark JSON now captures provenance (commit_sha), ROCm runtime/driver hints, tracked environment variables, and repeated run samples for reproducibility.

Documentation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github		.github
benchmarks		benchmarks
docs		docs
include		include
kernels		kernels
native		native
python		python
rust		rust
skills		skills
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
gfxGRAPH.code-workspace		gfxGRAPH.code-workspace
out.txt		out.txt
pyproject.toml		pyproject.toml
setup.py		setup.py
test_debug.py		test_debug.py
test_isolate.py		test_isolate.py
test_pool.py		test_pool.py
test_pool2.py		test_pool2.py
test_pool3.py		test_pool3.py
test_pool4.py		test_pool4.py
test_pool5.py		test_pool5.py
test_pool6.py		test_pool6.py
test_pure.py		test_pure.py
test_rust.py		test_rust.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

gfxGRAPH v0.3.4

At a Glance

Table of Contents

Target Hardware

Quick Start

Fastest Path: Tier 1

Native Path: Tier 2

Building the Rust Accelerators

Two Operating Tiers

Tier Comparison

Tier 1: Python-Only Mode

Tier 2: Full Native Mode

Option A: Build the Native Bridge Locally

Option B: Install the Native Companion Package

Usage

Standalone (any PyTorch code)

With SGLang

Via Environment Variable (auto-enables on import)

Architecture

Gaps Bridged

Routing Strategy

Observability

Troubleshooting

"Native bridge not available" message at startup

Health check returns ok: False

CUDA graphs fail during SGLang model loading

Fallback count keeps increasing

Current Capabilities & Performance (v0.3.4)

Verified capability snapshot

Public benchmark (RX 6700 XT / gfx1030, ROCm 7.2, torch 2.11.0+rocm7.2)

Documentation

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Health check returns `ok: False`

Packages