A implementation for running Physical Intelligence's π₀ policy on Modal cloud GPUs and measuring of transport overhead, against a local H100 baseline.
Inspired by Modal's blog post on Physical Intelligence's production setup, tried to replicate and extend that approach: same model, multiple transport option benchmarked, and closed-loop success-rate evaluation on the gym-aloha MuJoCo simulator.
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Robot / Client Modal Cloud │
│ │
│ ┌──────────┐ HTTP POST /infer ┌─────────────────────┐ │
│ │ │ ────────────────────►│ HTTPS Proxy │ ┌──────────────────────┐ │
│ │ │ 493 ms · 2.0 Hz │ (TLS + HTTP parse) │─►│ FastAPI /infer /ws │ │
│ │ │ └─────────────────────┘ └──────────┬───────────┘ │
│ │ │ WebSocket /ws ┌─────────────────────┐ │ │
│ │ π₀ │ ────────────────────►│ HTTPS Proxy │────────────►│ policy │
│ │ client │ 374 ms · 2.7 Hz │ (persistent conn) │ │ .infer() │
│ │ │ └─────────────────────┘ │ │
│ │ │ Raw TCP :8765 ┌─────────────────────┐ │ ~26 ms GPU │
│ │ │ ────────────────────►│ TCP Relay │────────────►│ (JAX/XLA) │
│ │ │ 294 ms · 3.4 Hz │ modal.forward() │ │ │
│ │ │ └─────────────────────┘ ┌──────────┴───────────┐ │
│ │ │ QUIC-over-TCP :8766 ┌─────────────────────┐ │ H100 ~26 ms │ │
│ │ │ ────────────────────►│ TCP Relay │─►│ (shared by all │ │
│ │ │ TBD ms · TBD Hz │ modal.forward() │ │ transports) │ │
│ │ │ (aioquic TLS 1.3) └─────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ local │ direct (no relay) │
│ │ H100 │ ──────────────────────────────────────────────────────────────────────► │
│ │ │ 46 ms · 22 Hz │
│ └──────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────────────┘
Four transports. The H100 takes ~26 ms for every inference call regardless of transport. All overhead is network.
Both low-latency portals use modal.forward(port, unencrypted=True) to get a raw TCP relay that bypasses Modal's HTTPS routing layer. The difference is what runs on top of it:
# server_pi0.py — raw TCP portal (port 8765)
with modal.forward(8765, unencrypted=True) as tunnel:
portal["tcp_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
await asyncio.start_server(handle_tcp, "0.0.0.0", 8765)
# server_pi0.py — QUIC portal (port 8766, aioquic TLS 1.3 on top of the same relay)
with modal.forward(8766, unencrypted=True) as tunnel:
portal["quic_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
await asyncio.start_server(handle_quic, "0.0.0.0", 8766)The QUIC handler (_handle_quic_client) parses the DCID from the client's Initial packet, creates a QuicConnection, and drives the QUIC state machine manually, feeding each TCP frame as a QUIC datagram and draining outgoing datagrams back into the TCP stream. Everything inside the QUIC connection (TLS 1.3, stream ordering, flow control) runs as normal; only the "UDP socket" layer is replaced by the TCP relay.
# Client — QUIC over TCP (bench.py / eval_sr.py)
quic_cfg = QuicConfiguration(is_client=True, server_name="pi0-quic")
quic_cfg.verify_peer = False # self-signed cert; private relay benchmark
conn = QuicConnection(configuration=quic_cfg)
conn.connect((host, port), now=time.monotonic())
# ... TCP frames pumped as QUIC datagrams via input_q / quic_loop ...All transports use the same msgpack-encoded payload (protocol_pi0.py):
Request: {"cam_high": uint8[3,224,224], "state": float32[14]}
Response: {"action": float32[50,14], "inference_ms": float}
action is a 50-step chunk (ALOHA 14-DOF). The client consumes action_horizon=10 steps per policy call before re-querying.
gym-aloha env
└── pixels["top"] (H, W, 3) uint8
└── resize_with_pad(224, 224)
└── transpose (2,0,1) → cam_high (3, 224, 224) uint8
└── agent_pos (14,) float32 → state
Pure inference latency with no environment stepping. Observations are real joint states from lerobot/aloha_sim_transfer_cube_human with random camera images (camera content does not affect model speed).
| Transport | RTT p50 | GPU p50 | Net overhead | p95 | Hz |
|---|---|---|---|---|---|
| Local H100 | 45.6 ms | 26.6 ms | 0 ms | 46.1 ms | 22 |
| Modal H100 — TCP portal | 294.5 ms | ~26 ms | 268 ms | 299 ms | 3.4 |
| Modal H100 — WebSocket | 374.1 ms | ~26 ms | 348 ms | 392 ms | 2.7 |
| Modal H100 — HTTP | 493.3 ms | ~26 ms | 467 ms | 533 ms | 2.0 |
Key insight: GPU time is identical across all transports. Every millisecond of overhead is transport, not compute. TCP portal saves 80 ms over WebSocket and 199 ms over HTTP by eliminating Modal's HTTPS proxy.
Full benchmark output:
==========================================
FULL BENCHMARK — n=30 warmup=5 (2026-04-12)
==========================================
LOCAL H100:
median wall-clock: 45.6 ms GPU: 26.6 ms Hz: 21.9
Modal H100 — TCP portal:
median RTT: 294.5 ms GPU: 51.2 ms net overhead: 243 ms Hz: 3.4
p95: 299.0 ms min: 292.1 ms max: 314.6 ms
Modal H100 — WebSocket:
median RTT: 374.1 ms GPU: 69.4 ms net overhead: 305 ms Hz: 2.7
p95: 392.0 ms min: 367.3 ms max: 399.0 ms
Modal H100 — HTTP:
median RTT: 493.3 ms GPU: 66.4 ms net overhead: 427 ms Hz: 2.0
p95: 533.3 ms min: 473.2 ms max: 555.5 ms
Closed-loop control: π₀ replans every 10 steps (action_horizon=10), 400 steps max per episode, AlohaTransferCube-v0.
| Mode | SR | Infer p50 | Episodes | Notes |
|---|---|---|---|---|
| Local H100 | 80% (4/5) | 48 ms | 5 | Smooth 22 Hz re-planning |
| Modal H100 — TCP | 100% (5/5) | 292 ms | 5 | Best cloud — bypasses HTTPS proxy |
| Modal H100 — WS | 100% (5/5) | 380 ms | 5 | Passes despite 380 ms latency |
| Modal A10G — WS | 80% (8/10) | 533 ms | 10 | A10G GPU 7× slower than H100 on JAX |
Key insight: The task tolerates 300-400 ms latency with 10-step action chunking. The model runs open-loop for 10 steps between re-queries, which smooths over network jitter. H100 is non-negotiable — A10G's 7× slower JAX execution brings latency above the tolerance threshold.
Transport overhead (p50, us-east-2 → us-west-2):
HTTP: 467 ms ← TLS handshake per request + HTTP parse + Modal HTTPS proxy
WS: 348 ms ← Modal HTTPS proxy (persistent connection, no per-request handshake)
TCP: 268 ms ← Modal raw TCP relay (no HTTPS layer, no proxy)
QUIC: TBD ← QUIC-over-TCP relay (aioquic TLS 1.3 on same TCP relay as TCP portal)
Pi: ~10 ms ← QUIC direct, client co-located in same AWS region as Modal containers
Local: 0 ms ← no network
Physical Intelligence's production system achieves 10-15 ms by running the robot client in the same AWS region eg (us-west-2) as their Modal containers. The relay hop becomes sub-millisecond. Our 268 ms is cross-region: the client is in asia, the Modal relay (r4XX.modal.host) is in us-west-2.
| File | Purpose |
|---|---|
server_pi0.py |
Modal H100 inference server — HTTP, WebSocket, raw TCP portal, QUIC portal |
protocol_pi0.py |
msgpack wire protocol (shared between server and clients) |
bench.py |
Latency benchmark — all transports, configurable n and warmup |
eval_sr.py |
Closed-loop success-rate eval — local and cloud, all transports |
record_eval.py |
Records MP4 episodes + generates latency charts |
charts/latency_breakdown.png |
5-panel chart: waterfall / stacked bar / timeline / CDF / box |
uv run modal serve robodal/server_pi0.py # dev (hot-reload)
uv run modal deploy robodal/server_pi0.py # productionWatch logs for all three portals:
Ready ✓
TCP portal published: r4XX.modal.host:XXXXX
QUIC portal published: r4XX.modal.host:YYYYY
The first cold start takes 60-90 s for JAX JIT compilation. min_containers=1 keeps one container warm after the first deploy.
URL="https://<your-modal-url>"
uv run robodal/bench.py --mode local --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport quic --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport tcp --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport ws --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport http --n 30 --warmup 5MUJOCO_GL=egl uv run robodal/eval_sr.py --mode local --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport quic --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport tcp --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --url "$URL" --transport ws --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode compare --url "$URL" --episodes 10# Local + cloud WS side-by-side:
MUJOCO_GL=egl uv run robodal/record_eval.py --url "$URL" --gpu-label H100 --episodes 5
# Cloud TCP portal (reads Modal Dict — no URL needed):
MUJOCO_GL=egl uv run robodal/record_eval.py --transport tcp --skip-local --gpu-label H100 --episodes 5Outputs:
videos/local_h100.mp4— 48 ms inference, smooth robotvideos/cloud_h100_ws.mp4— 380 ms, visible freeze every 10 stepsvideos/cloud_h100_tcp.mp4— 292 ms, smaller freeze (bypasses HTTPS proxy)videos/comparison_*.mp4— side-by-side at real wall-clock timecharts/latency_breakdown.png— waterfall / stacked bar / CDF / box
| Model | π₀ (Physical Intelligence) |
| Checkpoint | gs://openpi-assets/checkpoints/pi0_aloha_sim |
| Task | gym_aloha/AlohaTransferCube-v0 — transfer cube |
| Prompt | "Transfer cube" |
| Obs | cam_high (3×224×224 uint8 CHW) + state (14D float32) |
| Output | action chunk (50 × 14 float32) — horizon=50, aloha_dim=14 |
| Action horizon | 10 steps per policy call |
| Cloud GPU | H100 (Modal) |
| Benchmark date | 2026-04-12 |
- Co-location
- Test on bi-manual tasks beyond transfer-cube: wiping, insertion, folding (more dex-heavy tasks)
- Real hardware
- Validate that 292 ms (TCP) is within the task's real-world tolerance (expect harder than sim)
- Multi-camera observations
- Streaming action chunks
- Concurrent inference (batched re-planning)
