robodal — π₀ modal Inference

A implementation for running Physical Intelligence's π₀ policy on Modal cloud GPUs and measuring of transport overhead, against a local H100 baseline.

Inspired by Modal's blog post on Physical Intelligence's production setup, tried to replicate and extend that approach: same model, multiple transport option benchmarked, and closed-loop success-rate evaluation on the gym-aloha MuJoCo simulator.

Architecture

┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                          │
│   Robot / Client                              Modal Cloud                                │
│                                                                                          │
│   ┌──────────┐  HTTP POST /infer    ┌─────────────────────┐                             │
│   │          │ ────────────────────►│  HTTPS Proxy        │  ┌──────────────────────┐   │
│   │          │  493 ms  ·  2.0 Hz   │  (TLS + HTTP parse) │─►│  FastAPI  /infer /ws │   │
│   │          │                      └─────────────────────┘  └──────────┬───────────┘   │
│   │          │  WebSocket /ws        ┌─────────────────────┐             │               │
│   │  π₀      │ ────────────────────►│  HTTPS Proxy        │────────────►│  policy       │
│   │  client  │  374 ms  ·  2.7 Hz   │  (persistent conn)  │             │  .infer()     │
│   │          │                      └─────────────────────┘             │               │
│   │          │  Raw TCP  :8765       ┌─────────────────────┐             │  ~26 ms GPU   │
│   │          │ ────────────────────►│  TCP Relay          │────────────►│  (JAX/XLA)    │
│   │          │  294 ms  ·  3.4 Hz   │  modal.forward()    │             │               │
│   │          │                      └─────────────────────┘  ┌──────────┴───────────┐   │
│   │          │  QUIC-over-TCP  :8766 ┌─────────────────────┐  │    H100  ~26 ms      │   │
│   │          │ ────────────────────►│  TCP Relay          │─►│    (shared by all    │   │
│   │          │  TBD ms  ·  TBD Hz   │  modal.forward()    │  │     transports)      │   │
│   │          │  (aioquic TLS 1.3)   └─────────────────────┘  └──────────────────────┘   │
│   │          │                                                                           │
│   │  local   │  direct (no relay)                                                        │
│   │  H100    │ ──────────────────────────────────────────────────────────────────────►   │
│   │          │   46 ms  ·  22 Hz                                                         │
│   └──────────┘                                                                          │
│                                                                                          │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Four transports. The H100 takes ~26 ms for every inference call regardless of transport. All overhead is network.

How the TCP and QUIC portals work

Both low-latency portals use modal.forward(port, unencrypted=True) to get a raw TCP relay that bypasses Modal's HTTPS routing layer. The difference is what runs on top of it:

# server_pi0.py — raw TCP portal (port 8765)
with modal.forward(8765, unencrypted=True) as tunnel:
    portal["tcp_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
    await asyncio.start_server(handle_tcp, "0.0.0.0", 8765)

# server_pi0.py — QUIC portal (port 8766, aioquic TLS 1.3 on top of the same relay)
with modal.forward(8766, unencrypted=True) as tunnel:
    portal["quic_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
    await asyncio.start_server(handle_quic, "0.0.0.0", 8766)

The QUIC handler (_handle_quic_client) parses the DCID from the client's Initial packet, creates a QuicConnection, and drives the QUIC state machine manually, feeding each TCP frame as a QUIC datagram and draining outgoing datagrams back into the TCP stream. Everything inside the QUIC connection (TLS 1.3, stream ordering, flow control) runs as normal; only the "UDP socket" layer is replaced by the TCP relay.

# Client — QUIC over TCP (bench.py / eval_sr.py)
quic_cfg = QuicConfiguration(is_client=True, server_name="pi0-quic")
quic_cfg.verify_peer = False   # self-signed cert; private relay benchmark
conn = QuicConnection(configuration=quic_cfg)
conn.connect((host, port), now=time.monotonic())
# ... TCP frames pumped as QUIC datagrams via input_q / quic_loop ...

Wire protocol

All transports use the same msgpack-encoded payload (protocol_pi0.py):

Request:  {"cam_high": uint8[3,224,224], "state": float32[14]}
Response: {"action": float32[50,14], "inference_ms": float}

action is a 50-step chunk (ALOHA 14-DOF). The client consumes action_horizon=10 steps per policy call before re-querying.

Observation pipeline

gym-aloha env
  └── pixels["top"]  (H, W, 3) uint8
        └── resize_with_pad(224, 224)
              └── transpose (2,0,1)  → cam_high (3, 224, 224) uint8
  └── agent_pos (14,) float32        → state

Results

Experiment 1 — Latency benchmark (n=30, warmup=5)

Pure inference latency with no environment stepping. Observations are real joint states from lerobot/aloha_sim_transfer_cube_human with random camera images (camera content does not affect model speed).

Transport	RTT p50	GPU p50	Net overhead	p95	Hz
Local H100	45.6 ms	26.6 ms	0 ms	46.1 ms	22
Modal H100 — TCP portal	294.5 ms	~26 ms	268 ms	299 ms	3.4
Modal H100 — WebSocket	374.1 ms	~26 ms	348 ms	392 ms	2.7
Modal H100 — HTTP	493.3 ms	~26 ms	467 ms	533 ms	2.0

Key insight: GPU time is identical across all transports. Every millisecond of overhead is transport, not compute. TCP portal saves 80 ms over WebSocket and 199 ms over HTTP by eliminating Modal's HTTPS proxy.

Full benchmark output:

==========================================
 FULL BENCHMARK — n=30 warmup=5 (2026-04-12)
==========================================

LOCAL H100:
  median wall-clock: 45.6 ms   GPU: 26.6 ms   Hz: 21.9

Modal H100 — TCP portal:
  median RTT: 294.5 ms   GPU: 51.2 ms   net overhead: 243 ms   Hz: 3.4
  p95: 299.0 ms   min: 292.1 ms   max: 314.6 ms

Modal H100 — WebSocket:
  median RTT: 374.1 ms   GPU: 69.4 ms   net overhead: 305 ms   Hz: 2.7
  p95: 392.0 ms   min: 367.3 ms   max: 399.0 ms

Modal H100 — HTTP:
  median RTT: 493.3 ms   GPU: 66.4 ms   net overhead: 427 ms   Hz: 2.0
  p95: 533.3 ms   min: 473.2 ms   max: 555.5 ms

Experiment 2 — Success rate (gym-aloha MuJoCo sim, transfer-cube)

Closed-loop control: π₀ replans every 10 steps (action_horizon=10), 400 steps max per episode, AlohaTransferCube-v0.

Mode	SR	Infer p50	Episodes	Notes
Local H100	80% (4/5)	48 ms	5	Smooth 22 Hz re-planning
Modal H100 — TCP	100% (5/5)	292 ms	5	Best cloud — bypasses HTTPS proxy
Modal H100 — WS	100% (5/5)	380 ms	5	Passes despite 380 ms latency
Modal A10G — WS	80% (8/10)	533 ms	10	A10G GPU 7× slower than H100 on JAX

Key insight: The task tolerates 300-400 ms latency with 10-step action chunking. The model runs open-loop for 10 steps between re-queries, which smooths over network jitter. H100 is non-negotiable — A10G's 7× slower JAX execution brings latency above the tolerance threshold.

Experiment 3 — Transport overhead breakdown

Transport overhead (p50, us-east-2 → us-west-2):

  HTTP:   467 ms  ← TLS handshake per request + HTTP parse + Modal HTTPS proxy
  WS:     348 ms  ← Modal HTTPS proxy (persistent connection, no per-request handshake)
  TCP:    268 ms  ← Modal raw TCP relay (no HTTPS layer, no proxy)
  QUIC:   TBD     ← QUIC-over-TCP relay (aioquic TLS 1.3 on same TCP relay as TCP portal)
  Pi:     ~10 ms  ← QUIC direct, client co-located in same AWS region as Modal containers
  Local:    0 ms  ← no network

Physical Intelligence's production system achieves 10-15 ms by running the robot client in the same AWS region eg (us-west-2) as their Modal containers. The relay hop becomes sub-millisecond. Our 268 ms is cross-region: the client is in asia, the Modal relay (r4XX.modal.host) is in us-west-2.

Files

File	Purpose
`server_pi0.py`	Modal H100 inference server — HTTP, WebSocket, raw TCP portal, QUIC portal
`protocol_pi0.py`	msgpack wire protocol (shared between server and clients)
`bench.py`	Latency benchmark — all transports, configurable n and warmup
`eval_sr.py`	Closed-loop success-rate eval — local and cloud, all transports
`record_eval.py`	Records MP4 episodes + generates latency charts
`charts/latency_breakdown.png`	5-panel chart: waterfall / stacked bar / timeline / CDF / box

Quick Start

1. Deploy π₀ to Modal (H100)

uv run modal serve robodal/server_pi0.py      # dev (hot-reload)
uv run modal deploy robodal/server_pi0.py     # production

Watch logs for all three portals:

Ready ✓
TCP portal published: r4XX.modal.host:XXXXX
QUIC portal published: r4XX.modal.host:YYYYY

The first cold start takes 60-90 s for JAX JIT compilation. min_containers=1 keeps one container warm after the first deploy.

2. Run the full latency benchmark

URL="https://<your-modal-url>"

uv run robodal/bench.py --mode local                                --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport quic  --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport tcp   --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport ws    --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport http  --n 30 --warmup 5

3. Measure success rate

MUJOCO_GL=egl uv run robodal/eval_sr.py --mode local --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport quic --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport tcp  --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --url "$URL" --transport ws  --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode compare --url "$URL" --episodes 10

4. Record videos and charts

# Local + cloud WS side-by-side:
MUJOCO_GL=egl uv run robodal/record_eval.py --url "$URL" --gpu-label H100 --episodes 5

# Cloud TCP portal (reads Modal Dict — no URL needed):
MUJOCO_GL=egl uv run robodal/record_eval.py --transport tcp --skip-local --gpu-label H100 --episodes 5

Outputs:

videos/local_h100.mp4 — 48 ms inference, smooth robot
videos/cloud_h100_ws.mp4 — 380 ms, visible freeze every 10 steps
videos/cloud_h100_tcp.mp4 — 292 ms, smaller freeze (bypasses HTTPS proxy)
videos/comparison_*.mp4 — side-by-side at real wall-clock time
charts/latency_breakdown.png — waterfall / stacked bar / CDF / box

Eval Setup


Model	π₀ (Physical Intelligence)
Checkpoint	`gs://openpi-assets/checkpoints/pi0_aloha_sim`
Task	`gym_aloha/AlohaTransferCube-v0` — transfer cube
Prompt	`"Transfer cube"`
Obs	`cam_high` (3×224×224 uint8 CHW) + `state` (14D float32)
Output	action chunk (50 × 14 float32) — horizon=50, aloha_dim=14
Action horizon	10 steps per policy call
Cloud GPU	H100 (Modal)
Benchmark date	2026-04-12

interesting future directions

Co-location
Test on bi-manual tasks beyond transfer-cube: wiping, insertion, folding (more dex-heavy tasks)
Real hardware
Validate that 292 ms (TCP) is within the task's real-world tolerance (expect harder than sim)
Multi-camera observations
Streaming action chunks
Concurrent inference (batched re-planning)

❤️ Modal and Pi

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
charts		charts
videos		videos
.python-version		.python-version
README.md		README.md
bench.py		bench.py
client.py		client.py
eval.py		eval.py
eval_sr.py		eval_sr.py
protocol.py		protocol.py
protocol_pi0.py		protocol_pi0.py
pyproject.toml		pyproject.toml
record_eval.py		record_eval.py
server.py		server.py
server_pi0.py		server_pi0.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robodal — π₀ modal Inference

Architecture

How the TCP and QUIC portals work

Wire protocol

Observation pipeline

Results

Experiment 1 — Latency benchmark (n=30, warmup=5)

Experiment 2 — Success rate (gym-aloha MuJoCo sim, transfer-cube)

Experiment 3 — Transport overhead breakdown

Files

Quick Start

1. Deploy π₀ to Modal (H100)

2. Run the full latency benchmark

3. Measure success rate

4. Record videos and charts

Eval Setup

interesting future directions

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

robodal — π₀ modal Inference

Architecture

How the TCP and QUIC portals work

Wire protocol

Observation pipeline

Results

Experiment 1 — Latency benchmark (n=30, warmup=5)

Experiment 2 — Success rate (gym-aloha MuJoCo sim, transfer-cube)

Experiment 3 — Transport overhead breakdown

Files

Quick Start

1. Deploy π₀ to Modal (H100)

2. Run the full latency benchmark

3. Measure success rate

4. Record videos and charts

Eval Setup

interesting future directions

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages