add ci

Fiar Labs · Fiar Labs · commit 9155768fbc66 · 2026-03-22T12:33:55.000-07:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,33 @@
+# MCCL only builds on macOS Apple Silicon. Hosted macos-14+ runners are arm64 (M-series).
+name: CI
+
+on:
+  push:
+    branches: [main, master]
+  pull_request:
+    branches: [main, master]
+
+concurrency:
+  group: ci-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  build-test:
+    runs-on: macos-15
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: pip
+          cache-dependency-path: pyproject.toml
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Smoke tests (build + local 2-process DDP)
+        run: pytest -v tests/test_build.py tests/test_ddp.py
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -0,0 +1,37 @@
+# Build sdist + arm64 wheel on Apple Silicon, upload to PyPI.
+# Configure either:
+#   - PyPI "Trusted Publishing" for this repo + workflow (OIDC, no token), or
+#   - Repository secret PYPI_API_TOKEN (classic API token) and uncomment the `with:` block below.
+name: Publish to PyPI
+
+on:
+  release:
+    types: [published]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  id-token: write
+
+jobs:
+  publish:
+    runs-on: macos-15
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install build
+        run: pip install --upgrade pip build
+
+      - name: Build sdist and wheel
+        run: python -m build
+
+      # Uses repo secret PYPI_API_TOKEN (classic API token). For OIDC instead, remove `with:` and
+      # configure https://pypi.org/manage/project/settings/publishing/ for this workflow.
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          password: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/README.md b/README.md
@@ -1,35 +1,31 @@
 # MCCL
 
-`torch.distributed` backend for **DDP + collectives on MPS** (Apple Silicon). **TCP** by default; **RDMA** if the OS exposes it (`MCCL_TRANSPORT`).
+[![CI](https://github.com/OWNER/REPO/actions/workflows/ci.yml/badge.svg)](https://github.com/OWNER/REPO/actions/workflows/ci.yml)
 
-## Install
+`torch.distributed` backend for DDP and collectives on **MPS** (Apple Silicon). TCP by default; RDMA only if the machine/OS actually supports it.
+
+macOS **arm64** only. CI is `macos-15` + `pytest tests/test_build.py tests/test_ddp.py` (two processes on one runner, not two laptops).
+
+Fix **`OWNER/REPO`** in the badge link and in `pyproject.toml` → `[project.urls]` when the repo exists.
+
+## Requirements
 
-- Apple Silicon Mac, **Python 3.11+**, **PyTorch first** (nightlies / 2.10.0 tested), **Xcode CLT** (`xcode-select --install`). Optional: full Xcode for Metal precompile.
+- Apple Silicon Mac (arm64). No Intel.
+- **Xcode Command Line Tools** — `xcode-select --install` (needed to compile the extension).
+- **Full Xcode** — optional; speeds up Metal by emitting a `.metallib` at build time instead of JIT at runtime.
+- **Python 3.11+**
+- **PyTorch 2.5+** installed *before* you build or `pip install` this package.
+
+## Install
 
 ```bash
 pip install torch
-pip install -e .
+pip install mccl
 ```
 
-```python
-import torch.distributed as dist
-import mccl
-dist.init_process_group(backend="mccl", rank=rank, world_size=world_size)
-```
+Source tree: `pip install -e ".[dev]"`. If the PyPI name `mccl` is taken, rename in `pyproject.toml` and `setup.py`.
 
-Demo: https://github.com/user-attachments/assets/21865149-b077-4b65-93cc-f9e319ff0328  
-Ethernet/Wi‑Fi OK; slower than a wired link.
-
-## Docs
-
-| | |
-|--|--|
-| Two Macs / TB | [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md) |
-| Firewall, ports, buckets | [docs/MULTINODE.md](docs/MULTINODE.md) |
-| Hacking | [docs/DEVELOPING.md](docs/DEVELOPING.md) |
-| Tests | [TESTING.md](TESTING.md) |
-| Versions | [COMPATIBILITY.md](COMPATIBILITY.md) |
-| Timings (please add yours) | [RESULTS.md](RESULTS.md) |
+Demo: https://github.com/user-attachments/assets/21865149-b077-4b65-93cc-f9e319ff0328
 
 ## Examples
 
@@ -39,115 +35,116 @@ torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=295
   examples/ddp_dummy_train.py
 ```
 
+Defaults there: DDP `BATCH_SIZE=128` per rank → global 256 with 2 ranks; baseline path uses global 256 unless you override. Shrink batch if you OOM.
+
+Minimal DDP script (run with `torchrun` below). Multi-node needs `MCCL_LISTEN_ADDR`, `MCCL_PORT_BASE`, etc. — [docs/MULTINODE.md](docs/MULTINODE.md).
+
 ```python
-import mccl
+import os
+import torch
+import torch.nn as nn
+import torch.distributed as dist
 from torch.nn.parallel import DistributedDataParallel as DDP
-# mccl.init(...) optional — see Configuration
-dist.init_process_group(backend="mccl", rank=rank, world_size=world_size)
-model = DDP(MyModel().to("mps"))
+import mccl
+
+def main():
+    rank = int(os.environ["RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    device = torch.device("mps:0")
+
+    dist.init_process_group(backend="mccl", device_id=device)
+
+    torch.manual_seed(42)
+    model = nn.Sequential(nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10)).to(device)
+    ddp_model = DDP(model)
+    optimizer = torch.optim.AdamW(ddp_model.parameters(), lr=1e-3)
+    loss_fn = nn.CrossEntropyLoss()
+
+    for step in range(10):
+        x = torch.randn(8, 512, device=device)
+        y = torch.randint(0, 10, (8,), device=device)
+        optimizer.zero_grad(set_to_none=True)
+        loss_fn(ddp_model(x), y).backward()
+        optimizer.step()
+        if rank == 0:
+            print(step, "ok")
+
+    dist.destroy_process_group()
+
+if __name__ == "__main__":
+    main()
 ```
 
-`ddp_dummy_train.py` defaults: **DDP** `BATCH_SIZE=128` per rank (global **256** with 2 ranks); **`--baseline`** global **256** unless you set `BASELINE_BATCH_SIZE` / `BATCH_SIZE`. Lower if you OOM.
+```bash
+torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=29500 your_train.py
+```
+
+## Docs
 
-## Throughput (example — **your mileage will vary**)
+| | |
+|--|--|
+| Two Macs / TB | [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md) |
+| Firewall, ports, env tuning | [docs/MULTINODE.md](docs/MULTINODE.md) (`MCCL_*`, buckets, listen addr) |
+| Dev / layout | [docs/DEVELOPING.md](docs/DEVELOPING.md) |
+| Tests | [TESTING.md](TESTING.md) |
+| Versions | [COMPATIBILITY.md](COMPATIBILITY.md) |
+| Timings | [RESULTS.md](RESULTS.md) |
 
-Latest author run (`examples/ddp_dummy_train.py` + `--save-stats` + `examples/benchmark_throughput.py`):
+## Throughput
+
+One saved run, **M4 Max** + **M1 Max** MBPs, TCP over TB, global batch **256**, ~**96.5M** params. Your numbers will differ.
 
 ```
-single M1 Max (MPS):  78.3 samples/s   (global_batch=256, world=1)
+single M1 Max (MPS):   78.3 samples/s   (global_batch=256, world=1)
 DDP (MCCL):          134.2 samples/s   (global_batch=256, world=2)
-baseline / DDP:      0.58×  → DDP ~172% of baseline samples/s
-params:              96,510,024 (same both sides)
+baseline / DDP:      0.58×  (~172% DDP vs baseline)
 ```
 
-**Global batch 256:** baseline = `BASELINE_BATCH_SIZE=256` (or `BATCH_SIZE=256` on one process). DDP = **`BATCH_SIZE=128` per rank × 2 ranks** (or 256×1 on two nodes — same global).
-
-**Takeaway:** with this **~96M-param** model, **big batch + higher GPU/memory util**, **2-rank MCCL DDP beat one M1 Max** on **samples/s** in our JSON. With **small global batch** (we’ve seen **global 8**), **comm/sync dominated** and DDP looked **much slower**. **Hardware mix still matters** (slowest rank sets the step); **PR [RESULTS.md](RESULTS.md)** with your setup.
+Tiny batches = comm noise dominates. Different chips on each rank = slowest one paces the step.
 
 ```bash
-# Match global batch: e.g. DDP BATCH_SIZE=128 × world 2 → baseline BASELINE_BATCH_SIZE=256
 python examples/ddp_dummy_train.py --baseline --save-stats baseline_stats.json
 torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=29500 \
   examples/ddp_dummy_train.py --save-stats ddp_stats.json
 python examples/benchmark_throughput.py --baseline baseline_stats.json --ddp ddp_stats.json -o bench
 ```
 
-`bash scripts/benchmark_matrix.sh` — other checks. Env: [examples/ddp_dummy_train.py](examples/ddp_dummy_train.py).
+`bash scripts/benchmark_matrix.sh` for more sweeps.
 
 ![bench](bench.png)  
 ![bars](bench_bars.png)
 
-## Collectives
+## PyPI (maintainers)
 
-`allreduce`, `broadcast`, `barrier`, `allgather`, `reduce_scatter`, `send`, `recv`
+**CI (tests only):** push to **`main`** or **`master`**, or open a PR targeting those branches. That runs [`.github/workflows/ci.yml`](.github/workflows/ci.yml) — it does **not** upload to PyPI.
 
-## Configuration
-
-Python `MCCL.init` / `MCCLConfig` or env; Python wins if set before `init_process_group`.
-
-| Env | Default | Notes |
-|-----|---------|--------|
-| `MCCL_TRANSPORT` | `auto` | `tcp`, `rdma`, `auto` |
-| `MCCL_LOG_LEVEL` | `WARN` | |
-| `MCCL_LISTEN_ADDR` | auto | |
-| `MCCL_LINK_PROFILE` | — | `thunderbolt` → bigger buffers |
-| `MCCL_PORT_BASE` | `29600` | **≠** `MASTER_PORT` |
-| `MCCL_IFNAME` | auto | |
-| `MCCL_CHUNK_BYTES` | `4194304` | |
-| `MCCL_SMALL_MSG_THRESHOLD` | `65536` | |
-| `MCCL_TRANSPORT_CRC` | `false` | |
-| `MCCL_FAST_MATH` | `true` | |
-| `MCCL_GPU_THRESHOLD` | `4096` | |
-| `MCCL_COMPRESSION` | `none` | `fp16`, `topk` |
-| `MCCL_TOPK_RATIO` | `0.01` | |
-| `MCCL_SHADER_PATH` | auto | |
+**Upload:** [`.github/workflows/publish.yml`](.github/workflows/publish.yml) runs on **GitHub Release (published)** or **Actions → Publish to PyPI → Run workflow** (`workflow_dispatch`).
 
-## Diagnostics
+1. GitHub repo → **Settings → Secrets and variables → Actions** → New repository secret **`PYPI_API_TOKEN`** (PyPI → Account settings → API tokens).
+2. Bump **`version`** in `pyproject.toml`, `setup.py`, and the assertion in `tests/test_build.py`.
+3. Either: **Releases → Draft a new release** → publish (triggers upload), or **Actions** tab → **Publish to PyPI** → **Run workflow** → branch `main`.
 
-```python
-mccl.get_metrics(); mccl.log_metrics(); mccl.reset_metrics()
-```
-
-`MCCL_LOG_LEVEL=INFO` — full config on startup. **Hung multi-node:** [docs/MULTINODE.md](docs/MULTINODE.md).
-
-## Thunderbolt, TCP, and what was tested
-
-- **Throughput charts in this repo:** **TCP** over a **Thunderbolt bridge IP** (TB3-class link here). **Not RDMA.** Plots match the **high global-batch** run above unless you regenerate.
-- **Ethernet / Wi‑Fi:** same code paths; expect worse RTT and throughput than a direct TB cable between hosts.
-- **`MCCL_LINK_PROFILE=thunderbolt`:** optional buffer/chunk tuning when peers are on a TB link — see [docs/MULTINODE.md](docs/MULTINODE.md).
-- **Cabling / `169.254.x.x` / firewall:** [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md).
+First-time PyPI: create the **`mccl`** project on pypi.org (or change the package `name` everywhere if the name is taken).
 
-## RDMA (Thunderbolt 5)
-
-RDMA is **optional** and **not** what produced the benchmark plots above. It needs **TB5 hardware** (e.g. **M4 Pro / Max / Ultra** with TB5 ports), a **direct TB5 cable** (no hub), and a **macOS build that ships `librdma.dylib`** (Apple’s docs call out newer macOS versions). We have **not** done serious end-to-end training benchmarks on RDMA here — treat latency/BW claims you see elsewhere as **hypotheses until you measure**.
+## Collectives
 
-**Enable once (Recovery OS):** `rdma_ctl enable` → reboot.
+`allreduce`, `broadcast`, `barrier`, `allgather`, `reduce_scatter`, `send`, `recv`
 
-**Sanity check:**
+## Diagnostics
 
-```bash
-python -c "import ctypes; ctypes.cdll.LoadLibrary('librdma.dylib'); print('RDMA lib OK')"
+```python
+mccl.get_metrics(); mccl.log_metrics(); mccl.reset_metrics()
 ```
 
-**Use in MCCL:** `MCCL_TRANSPORT=rdma` or `mccl.init(transport="rdma")`.
+Verbose startup: `MCCL_LOG_LEVEL=INFO`. Stuck multi-node: [docs/MULTINODE.md](docs/MULTINODE.md).
 
-| Mode | Behavior |
-|------|----------|
-| `auto` (default) | Try RDMA, fall back to TCP |
-| `rdma` | RDMA preferred; TCP if init fails |
-| `tcp` | Skip RDMA; TCP only |
+## Transport
 
-**Gotchas:** if RDMA init fails, logs at `MCCL_LOG_LEVEL=INFO` say why. Apple limits **~100 memory regions** per device — very large jobs can hit MR registration errors (reduce ranks or buffer churn). **PR [RESULTS.md](RESULTS.md)** if you have real TB5+RDMA training numbers.
+Bench plots were TCP over a Thunderbolt-style link, not RDMA. Wi‑Fi/Ethernet work, just slower. TB wiring: [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md). RDMA path exists on TB5-capable hardware + `librdma.dylib`; `rdma_ctl enable` from Recovery once; we didn’t use that for the graphs above.
 
 ## Internals
 
-**f32 reductions on CPU-visible MPS memory:** element-wise adds, scales, min/max, product go through **Apple Accelerate / vDSP** (`vDSP_vadd`, `vDSP_vsmul`, …) in `csrc/metal/AccelerateOps.mm`, with a **parallel chunk loop** so big tensors use all performance cores. On Apple Silicon, those **vector library paths are AMX-backed where Accelerate routes them** — we’re not hand-writing AMX kernels; we lean on **vDSP + UMA** (CPU pointer is valid for shared MPS buffers, so reductions can run without extra copies in the common f32 case).
-
-**fp16 / bf16:** Metal shaders when above `MCCL_GPU_THRESHOLD`; small tensors can widen → **f32 vDSP** → narrow.
-
-**Network:** TCP progress thread, overlap, 2-rank vs ring allreduce. **CRC:** ARM `crc32` hw when enabled.
-
-More layout / build: [docs/DEVELOPING.md](docs/DEVELOPING.md).
+f32 reductions: Accelerate/vDSP on CPU-visible MPS memory (`csrc/metal/AccelerateOps.mm`), chunked; Apple’s math can route through AMX. fp16/bf16: Metal when big enough, else widen to f32 and back. Network: TCP progress thread, ring/direct allreduce depending on world size. Details: [docs/DEVELOPING.md](docs/DEVELOPING.md)
 
 ## License
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -16,6 +16,20 @@ requires-python = ">=3.11"
 dependencies = [
     "torch>=2.5.0",
 ]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: MacOS :: MacOS X",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+# Replace with your GitHub path after publishing (badge + PyPI sidebar).
+[project.urls]
+Repository = "https://github.com/OWNER/REPO"
+
 
 [project.optional-dependencies]
 dev = [