Skip to content

vovw/robodal

Repository files navigation

robodal — π₀ modal Inference

A implementation for running Physical Intelligence's π₀ policy on Modal cloud GPUs and measuring of transport overhead, against a local H100 baseline.

Inspired by Modal's blog post on Physical Intelligence's production setup, tried to replicate and extend that approach: same model, multiple transport option benchmarked, and closed-loop success-rate evaluation on the gym-aloha MuJoCo simulator.

Demo


Architecture

┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                          │
│   Robot / Client                              Modal Cloud                                │
│                                                                                          │
│   ┌──────────┐  HTTP POST /infer    ┌─────────────────────┐                             │
│   │          │ ────────────────────►│  HTTPS Proxy        │  ┌──────────────────────┐   │
│   │          │  493 ms  ·  2.0 Hz   │  (TLS + HTTP parse) │─►│  FastAPI  /infer /ws │   │
│   │          │                      └─────────────────────┘  └──────────┬───────────┘   │
│   │          │  WebSocket /ws        ┌─────────────────────┐             │               │
│   │  π₀      │ ────────────────────►│  HTTPS Proxy        │────────────►│  policy       │
│   │  client  │  374 ms  ·  2.7 Hz   │  (persistent conn)  │             │  .infer()     │
│   │          │                      └─────────────────────┘             │               │
│   │          │  Raw TCP  :8765       ┌─────────────────────┐             │  ~26 ms GPU   │
│   │          │ ────────────────────►│  TCP Relay          │────────────►│  (JAX/XLA)    │
│   │          │  294 ms  ·  3.4 Hz   │  modal.forward()    │             │               │
│   │          │                      └─────────────────────┘  ┌──────────┴───────────┐   │
│   │          │  QUIC-over-TCP  :8766 ┌─────────────────────┐  │    H100  ~26 ms      │   │
│   │          │ ────────────────────►│  TCP Relay          │─►│    (shared by all    │   │
│   │          │  TBD ms  ·  TBD Hz   │  modal.forward()    │  │     transports)      │   │
│   │          │  (aioquic TLS 1.3)   └─────────────────────┘  └──────────────────────┘   │
│   │          │                                                                           │
│   │  local   │  direct (no relay)                                                        │
│   │  H100    │ ──────────────────────────────────────────────────────────────────────►   │
│   │          │   46 ms  ·  22 Hz                                                         │
│   └──────────┘                                                                          │
│                                                                                          │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Four transports. The H100 takes ~26 ms for every inference call regardless of transport. All overhead is network.

How the TCP and QUIC portals work

Both low-latency portals use modal.forward(port, unencrypted=True) to get a raw TCP relay that bypasses Modal's HTTPS routing layer. The difference is what runs on top of it:

# server_pi0.py — raw TCP portal (port 8765)
with modal.forward(8765, unencrypted=True) as tunnel:
    portal["tcp_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
    await asyncio.start_server(handle_tcp, "0.0.0.0", 8765)

# server_pi0.py — QUIC portal (port 8766, aioquic TLS 1.3 on top of the same relay)
with modal.forward(8766, unencrypted=True) as tunnel:
    portal["quic_endpoint"] = f"{tunnel.tcp_socket[0]}:{tunnel.tcp_socket[1]}"
    await asyncio.start_server(handle_quic, "0.0.0.0", 8766)

The QUIC handler (_handle_quic_client) parses the DCID from the client's Initial packet, creates a QuicConnection, and drives the QUIC state machine manually, feeding each TCP frame as a QUIC datagram and draining outgoing datagrams back into the TCP stream. Everything inside the QUIC connection (TLS 1.3, stream ordering, flow control) runs as normal; only the "UDP socket" layer is replaced by the TCP relay.

# Client — QUIC over TCP (bench.py / eval_sr.py)
quic_cfg = QuicConfiguration(is_client=True, server_name="pi0-quic")
quic_cfg.verify_peer = False   # self-signed cert; private relay benchmark
conn = QuicConnection(configuration=quic_cfg)
conn.connect((host, port), now=time.monotonic())
# ... TCP frames pumped as QUIC datagrams via input_q / quic_loop ...

Wire protocol

All transports use the same msgpack-encoded payload (protocol_pi0.py):

Request:  {"cam_high": uint8[3,224,224], "state": float32[14]}
Response: {"action": float32[50,14], "inference_ms": float}

action is a 50-step chunk (ALOHA 14-DOF). The client consumes action_horizon=10 steps per policy call before re-querying.

Observation pipeline

gym-aloha env
  └── pixels["top"]  (H, W, 3) uint8
        └── resize_with_pad(224, 224)
              └── transpose (2,0,1)  → cam_high (3, 224, 224) uint8
  └── agent_pos (14,) float32        → state

Results

Experiment 1 — Latency benchmark (n=30, warmup=5)

Pure inference latency with no environment stepping. Observations are real joint states from lerobot/aloha_sim_transfer_cube_human with random camera images (camera content does not affect model speed).

Transport RTT p50 GPU p50 Net overhead p95 Hz
Local H100 45.6 ms 26.6 ms 0 ms 46.1 ms 22
Modal H100 — TCP portal 294.5 ms ~26 ms 268 ms 299 ms 3.4
Modal H100 — WebSocket 374.1 ms ~26 ms 348 ms 392 ms 2.7
Modal H100 — HTTP 493.3 ms ~26 ms 467 ms 533 ms 2.0

Key insight: GPU time is identical across all transports. Every millisecond of overhead is transport, not compute. TCP portal saves 80 ms over WebSocket and 199 ms over HTTP by eliminating Modal's HTTPS proxy.

Full benchmark output:

==========================================
 FULL BENCHMARK — n=30 warmup=5 (2026-04-12)
==========================================

LOCAL H100:
  median wall-clock: 45.6 ms   GPU: 26.6 ms   Hz: 21.9

Modal H100 — TCP portal:
  median RTT: 294.5 ms   GPU: 51.2 ms   net overhead: 243 ms   Hz: 3.4
  p95: 299.0 ms   min: 292.1 ms   max: 314.6 ms

Modal H100 — WebSocket:
  median RTT: 374.1 ms   GPU: 69.4 ms   net overhead: 305 ms   Hz: 2.7
  p95: 392.0 ms   min: 367.3 ms   max: 399.0 ms

Modal H100 — HTTP:
  median RTT: 493.3 ms   GPU: 66.4 ms   net overhead: 427 ms   Hz: 2.0
  p95: 533.3 ms   min: 473.2 ms   max: 555.5 ms

Experiment 2 — Success rate (gym-aloha MuJoCo sim, transfer-cube)

Closed-loop control: π₀ replans every 10 steps (action_horizon=10), 400 steps max per episode, AlohaTransferCube-v0.

Mode SR Infer p50 Episodes Notes
Local H100 80% (4/5) 48 ms 5 Smooth 22 Hz re-planning
Modal H100 — TCP 100% (5/5) 292 ms 5 Best cloud — bypasses HTTPS proxy
Modal H100 — WS 100% (5/5) 380 ms 5 Passes despite 380 ms latency
Modal A10G — WS 80% (8/10) 533 ms 10 A10G GPU 7× slower than H100 on JAX

Key insight: The task tolerates 300-400 ms latency with 10-step action chunking. The model runs open-loop for 10 steps between re-queries, which smooths over network jitter. H100 is non-negotiable — A10G's 7× slower JAX execution brings latency above the tolerance threshold.

Experiment 3 — Transport overhead breakdown

Transport overhead (p50, us-east-2 → us-west-2):

  HTTP:   467 ms  ← TLS handshake per request + HTTP parse + Modal HTTPS proxy
  WS:     348 ms  ← Modal HTTPS proxy (persistent connection, no per-request handshake)
  TCP:    268 ms  ← Modal raw TCP relay (no HTTPS layer, no proxy)
  QUIC:   TBD     ← QUIC-over-TCP relay (aioquic TLS 1.3 on same TCP relay as TCP portal)
  Pi:     ~10 ms  ← QUIC direct, client co-located in same AWS region as Modal containers
  Local:    0 ms  ← no network

Physical Intelligence's production system achieves 10-15 ms by running the robot client in the same AWS region eg (us-west-2) as their Modal containers. The relay hop becomes sub-millisecond. Our 268 ms is cross-region: the client is in asia, the Modal relay (r4XX.modal.host) is in us-west-2.


Files

File Purpose
server_pi0.py Modal H100 inference server — HTTP, WebSocket, raw TCP portal, QUIC portal
protocol_pi0.py msgpack wire protocol (shared between server and clients)
bench.py Latency benchmark — all transports, configurable n and warmup
eval_sr.py Closed-loop success-rate eval — local and cloud, all transports
record_eval.py Records MP4 episodes + generates latency charts
charts/latency_breakdown.png 5-panel chart: waterfall / stacked bar / timeline / CDF / box

Quick Start

1. Deploy π₀ to Modal (H100)

uv run modal serve robodal/server_pi0.py      # dev (hot-reload)
uv run modal deploy robodal/server_pi0.py     # production

Watch logs for all three portals:

Ready ✓
TCP portal published: r4XX.modal.host:XXXXX
QUIC portal published: r4XX.modal.host:YYYYY

The first cold start takes 60-90 s for JAX JIT compilation. min_containers=1 keeps one container warm after the first deploy.

2. Run the full latency benchmark

URL="https://<your-modal-url>"

uv run robodal/bench.py --mode local                                --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport quic  --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport tcp   --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport ws    --n 30 --warmup 5
uv run robodal/bench.py --mode cloud --url "$URL" --transport http  --n 30 --warmup 5

3. Measure success rate

MUJOCO_GL=egl uv run robodal/eval_sr.py --mode local --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport quic --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --transport tcp  --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode cloud --url "$URL" --transport ws  --episodes 10
MUJOCO_GL=egl uv run robodal/eval_sr.py --mode compare --url "$URL" --episodes 10

4. Record videos and charts

# Local + cloud WS side-by-side:
MUJOCO_GL=egl uv run robodal/record_eval.py --url "$URL" --gpu-label H100 --episodes 5

# Cloud TCP portal (reads Modal Dict — no URL needed):
MUJOCO_GL=egl uv run robodal/record_eval.py --transport tcp --skip-local --gpu-label H100 --episodes 5

Outputs:

  • videos/local_h100.mp4 — 48 ms inference, smooth robot
  • videos/cloud_h100_ws.mp4 — 380 ms, visible freeze every 10 steps
  • videos/cloud_h100_tcp.mp4 — 292 ms, smaller freeze (bypasses HTTPS proxy)
  • videos/comparison_*.mp4 — side-by-side at real wall-clock time
  • charts/latency_breakdown.png — waterfall / stacked bar / CDF / box

Eval Setup

Model π₀ (Physical Intelligence)
Checkpoint gs://openpi-assets/checkpoints/pi0_aloha_sim
Task gym_aloha/AlohaTransferCube-v0 — transfer cube
Prompt "Transfer cube"
Obs cam_high (3×224×224 uint8 CHW) + state (14D float32)
Output action chunk (50 × 14 float32) — horizon=50, aloha_dim=14
Action horizon 10 steps per policy call
Cloud GPU H100 (Modal)
Benchmark date 2026-04-12

interesting future directions

  • Co-location
  • Test on bi-manual tasks beyond transfer-cube: wiping, insertion, folding (more dex-heavy tasks)
  • Real hardware
  • Validate that 292 ms (TCP) is within the task's real-world tolerance (expect harder than sim)
  • Multi-camera observations
  • Streaming action chunks
  • Concurrent inference (batched re-planning)

❤️ Modal and Pi

About

modal inference for robot models

Resources

Stars

Watchers

Forks

Contributors

Languages