Skip to content

Commit 9155768

Browse files
Fiar LabsFiar Labs
authored andcommitted
add ci
1 parent e8df95d commit 9155768

4 files changed

Lines changed: 174 additions & 93 deletions

File tree

.github/workflows/ci.yml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# MCCL only builds on macOS Apple Silicon. Hosted macos-14+ runners are arm64 (M-series).
2+
name: CI
3+
4+
on:
5+
push:
6+
branches: [main, master]
7+
pull_request:
8+
branches: [main, master]
9+
10+
concurrency:
11+
group: ci-${{ github.workflow }}-${{ github.ref }}
12+
cancel-in-progress: true
13+
14+
jobs:
15+
build-test:
16+
runs-on: macos-15
17+
steps:
18+
- uses: actions/checkout@v4
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v5
22+
with:
23+
python-version: "3.11"
24+
cache: pip
25+
cache-dependency-path: pyproject.toml
26+
27+
- name: Install dependencies
28+
run: |
29+
python -m pip install --upgrade pip
30+
pip install -e ".[dev]"
31+
32+
- name: Smoke tests (build + local 2-process DDP)
33+
run: pytest -v tests/test_build.py tests/test_ddp.py

.github/workflows/publish.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Build sdist + arm64 wheel on Apple Silicon, upload to PyPI.
2+
# Configure either:
3+
# - PyPI "Trusted Publishing" for this repo + workflow (OIDC, no token), or
4+
# - Repository secret PYPI_API_TOKEN (classic API token) and uncomment the `with:` block below.
5+
name: Publish to PyPI
6+
7+
on:
8+
release:
9+
types: [published]
10+
workflow_dispatch:
11+
12+
permissions:
13+
contents: read
14+
id-token: write
15+
16+
jobs:
17+
publish:
18+
runs-on: macos-15
19+
steps:
20+
- uses: actions/checkout@v4
21+
22+
- uses: actions/setup-python@v5
23+
with:
24+
python-version: "3.11"
25+
26+
- name: Install build
27+
run: pip install --upgrade pip build
28+
29+
- name: Build sdist and wheel
30+
run: python -m build
31+
32+
# Uses repo secret PYPI_API_TOKEN (classic API token). For OIDC instead, remove `with:` and
33+
# configure https://pypi.org/manage/project/settings/publishing/ for this workflow.
34+
- name: Publish to PyPI
35+
uses: pypa/gh-action-pypi-publish@release/v1
36+
with:
37+
password: ${{ secrets.PYPI_API_TOKEN }}

README.md

Lines changed: 90 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,31 @@
11
# MCCL
22

3-
`torch.distributed` backend for **DDP + collectives on MPS** (Apple Silicon). **TCP** by default; **RDMA** if the OS exposes it (`MCCL_TRANSPORT`).
3+
[![CI](https://github.com/OWNER/REPO/actions/workflows/ci.yml/badge.svg)](https://github.com/OWNER/REPO/actions/workflows/ci.yml)
44

5-
## Install
5+
`torch.distributed` backend for DDP and collectives on **MPS** (Apple Silicon). TCP by default; RDMA only if the machine/OS actually supports it.
6+
7+
macOS **arm64** only. CI is `macos-15` + `pytest tests/test_build.py tests/test_ddp.py` (two processes on one runner, not two laptops).
8+
9+
Fix **`OWNER/REPO`** in the badge link and in `pyproject.toml``[project.urls]` when the repo exists.
10+
11+
## Requirements
612

7-
- Apple Silicon Mac, **Python 3.11+**, **PyTorch first** (nightlies / 2.10.0 tested), **Xcode CLT** (`xcode-select --install`). Optional: full Xcode for Metal precompile.
13+
- Apple Silicon Mac (arm64). No Intel.
14+
- **Xcode Command Line Tools**`xcode-select --install` (needed to compile the extension).
15+
- **Full Xcode** — optional; speeds up Metal by emitting a `.metallib` at build time instead of JIT at runtime.
16+
- **Python 3.11+**
17+
- **PyTorch 2.5+** installed *before* you build or `pip install` this package.
18+
19+
## Install
820

921
```bash
1022
pip install torch
11-
pip install -e .
23+
pip install mccl
1224
```
1325

14-
```python
15-
import torch.distributed as dist
16-
import mccl
17-
dist.init_process_group(backend="mccl", rank=rank, world_size=world_size)
18-
```
26+
Source tree: `pip install -e ".[dev]"`. If the PyPI name `mccl` is taken, rename in `pyproject.toml` and `setup.py`.
1927

20-
Demo: https://github.com/user-attachments/assets/21865149-b077-4b65-93cc-f9e319ff0328
21-
Ethernet/Wi‑Fi OK; slower than a wired link.
22-
23-
## Docs
24-
25-
| | |
26-
|--|--|
27-
| Two Macs / TB | [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md) |
28-
| Firewall, ports, buckets | [docs/MULTINODE.md](docs/MULTINODE.md) |
29-
| Hacking | [docs/DEVELOPING.md](docs/DEVELOPING.md) |
30-
| Tests | [TESTING.md](TESTING.md) |
31-
| Versions | [COMPATIBILITY.md](COMPATIBILITY.md) |
32-
| Timings (please add yours) | [RESULTS.md](RESULTS.md) |
28+
Demo: https://github.com/user-attachments/assets/21865149-b077-4b65-93cc-f9e319ff0328
3329

3430
## Examples
3531

@@ -39,115 +35,116 @@ torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=295
3935
examples/ddp_dummy_train.py
4036
```
4137

38+
Defaults there: DDP `BATCH_SIZE=128` per rank → global 256 with 2 ranks; baseline path uses global 256 unless you override. Shrink batch if you OOM.
39+
40+
Minimal DDP script (run with `torchrun` below). Multi-node needs `MCCL_LISTEN_ADDR`, `MCCL_PORT_BASE`, etc. — [docs/MULTINODE.md](docs/MULTINODE.md).
41+
4242
```python
43-
import mccl
43+
import os
44+
import torch
45+
import torch.nn as nn
46+
import torch.distributed as dist
4447
from torch.nn.parallel import DistributedDataParallel as DDP
45-
# mccl.init(...) optional — see Configuration
46-
dist.init_process_group(backend="mccl", rank=rank, world_size=world_size)
47-
model = DDP(MyModel().to("mps"))
48+
import mccl
49+
50+
def main():
51+
rank = int(os.environ["RANK"])
52+
world_size = int(os.environ["WORLD_SIZE"])
53+
device = torch.device("mps:0")
54+
55+
dist.init_process_group(backend="mccl", device_id=device)
56+
57+
torch.manual_seed(42)
58+
model = nn.Sequential(nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10)).to(device)
59+
ddp_model = DDP(model)
60+
optimizer = torch.optim.AdamW(ddp_model.parameters(), lr=1e-3)
61+
loss_fn = nn.CrossEntropyLoss()
62+
63+
for step in range(10):
64+
x = torch.randn(8, 512, device=device)
65+
y = torch.randint(0, 10, (8,), device=device)
66+
optimizer.zero_grad(set_to_none=True)
67+
loss_fn(ddp_model(x), y).backward()
68+
optimizer.step()
69+
if rank == 0:
70+
print(step, "ok")
71+
72+
dist.destroy_process_group()
73+
74+
if __name__ == "__main__":
75+
main()
4876
```
4977

50-
`ddp_dummy_train.py` defaults: **DDP** `BATCH_SIZE=128` per rank (global **256** with 2 ranks); **`--baseline`** global **256** unless you set `BASELINE_BATCH_SIZE` / `BATCH_SIZE`. Lower if you OOM.
78+
```bash
79+
torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=29500 your_train.py
80+
```
81+
82+
## Docs
5183

52-
## Throughput (example — **your mileage will vary**)
84+
| | |
85+
|--|--|
86+
| Two Macs / TB | [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md) |
87+
| Firewall, ports, env tuning | [docs/MULTINODE.md](docs/MULTINODE.md) (`MCCL_*`, buckets, listen addr) |
88+
| Dev / layout | [docs/DEVELOPING.md](docs/DEVELOPING.md) |
89+
| Tests | [TESTING.md](TESTING.md) |
90+
| Versions | [COMPATIBILITY.md](COMPATIBILITY.md) |
91+
| Timings | [RESULTS.md](RESULTS.md) |
5392

54-
Latest author run (`examples/ddp_dummy_train.py` + `--save-stats` + `examples/benchmark_throughput.py`):
93+
## Throughput
94+
95+
One saved run, **M4 Max** + **M1 Max** MBPs, TCP over TB, global batch **256**, ~**96.5M** params. Your numbers will differ.
5596

5697
```
57-
single M1 Max (MPS): 78.3 samples/s (global_batch=256, world=1)
98+
single M1 Max (MPS): 78.3 samples/s (global_batch=256, world=1)
5899
DDP (MCCL): 134.2 samples/s (global_batch=256, world=2)
59-
baseline / DDP: 0.58× → DDP ~172% of baseline samples/s
60-
params: 96,510,024 (same both sides)
100+
baseline / DDP: 0.58× (~172% DDP vs baseline)
61101
```
62102

63-
**Global batch 256:** baseline = `BASELINE_BATCH_SIZE=256` (or `BATCH_SIZE=256` on one process). DDP = **`BATCH_SIZE=128` per rank × 2 ranks** (or 256×1 on two nodes — same global).
64-
65-
**Takeaway:** with this **~96M-param** model, **big batch + higher GPU/memory util**, **2-rank MCCL DDP beat one M1 Max** on **samples/s** in our JSON. With **small global batch** (we’ve seen **global 8**), **comm/sync dominated** and DDP looked **much slower**. **Hardware mix still matters** (slowest rank sets the step); **PR [RESULTS.md](RESULTS.md)** with your setup.
103+
Tiny batches = comm noise dominates. Different chips on each rank = slowest one paces the step.
66104

67105
```bash
68-
# Match global batch: e.g. DDP BATCH_SIZE=128 × world 2 → baseline BASELINE_BATCH_SIZE=256
69106
python examples/ddp_dummy_train.py --baseline --save-stats baseline_stats.json
70107
torchrun --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=29500 \
71108
examples/ddp_dummy_train.py --save-stats ddp_stats.json
72109
python examples/benchmark_throughput.py --baseline baseline_stats.json --ddp ddp_stats.json -o bench
73110
```
74111

75-
`bash scripts/benchmark_matrix.sh` — other checks. Env: [examples/ddp_dummy_train.py](examples/ddp_dummy_train.py).
112+
`bash scripts/benchmark_matrix.sh` for more sweeps.
76113

77114
![bench](bench.png)
78115
![bars](bench_bars.png)
79116

80-
## Collectives
117+
## PyPI (maintainers)
81118

82-
`allreduce`, `broadcast`, `barrier`, `allgather`, `reduce_scatter`, `send`, `recv`
119+
**CI (tests only):** push to **`main`** or **`master`**, or open a PR targeting those branches. That runs [`.github/workflows/ci.yml`](.github/workflows/ci.yml) — it does **not** upload to PyPI.
83120

84-
## Configuration
85-
86-
Python `MCCL.init` / `MCCLConfig` or env; Python wins if set before `init_process_group`.
87-
88-
| Env | Default | Notes |
89-
|-----|---------|--------|
90-
| `MCCL_TRANSPORT` | `auto` | `tcp`, `rdma`, `auto` |
91-
| `MCCL_LOG_LEVEL` | `WARN` | |
92-
| `MCCL_LISTEN_ADDR` | auto | |
93-
| `MCCL_LINK_PROFILE` || `thunderbolt` → bigger buffers |
94-
| `MCCL_PORT_BASE` | `29600` | **** `MASTER_PORT` |
95-
| `MCCL_IFNAME` | auto | |
96-
| `MCCL_CHUNK_BYTES` | `4194304` | |
97-
| `MCCL_SMALL_MSG_THRESHOLD` | `65536` | |
98-
| `MCCL_TRANSPORT_CRC` | `false` | |
99-
| `MCCL_FAST_MATH` | `true` | |
100-
| `MCCL_GPU_THRESHOLD` | `4096` | |
101-
| `MCCL_COMPRESSION` | `none` | `fp16`, `topk` |
102-
| `MCCL_TOPK_RATIO` | `0.01` | |
103-
| `MCCL_SHADER_PATH` | auto | |
121+
**Upload:** [`.github/workflows/publish.yml`](.github/workflows/publish.yml) runs on **GitHub Release (published)** or **Actions → Publish to PyPI → Run workflow** (`workflow_dispatch`).
104122

105-
## Diagnostics
123+
1. GitHub repo → **Settings → Secrets and variables → Actions** → New repository secret **`PYPI_API_TOKEN`** (PyPI → Account settings → API tokens).
124+
2. Bump **`version`** in `pyproject.toml`, `setup.py`, and the assertion in `tests/test_build.py`.
125+
3. Either: **Releases → Draft a new release** → publish (triggers upload), or **Actions** tab → **Publish to PyPI****Run workflow** → branch `main`.
106126

107-
```python
108-
mccl.get_metrics(); mccl.log_metrics(); mccl.reset_metrics()
109-
```
110-
111-
`MCCL_LOG_LEVEL=INFO` — full config on startup. **Hung multi-node:** [docs/MULTINODE.md](docs/MULTINODE.md).
112-
113-
## Thunderbolt, TCP, and what was tested
114-
115-
- **Throughput charts in this repo:** **TCP** over a **Thunderbolt bridge IP** (TB3-class link here). **Not RDMA.** Plots match the **high global-batch** run above unless you regenerate.
116-
- **Ethernet / Wi‑Fi:** same code paths; expect worse RTT and throughput than a direct TB cable between hosts.
117-
- **`MCCL_LINK_PROFILE=thunderbolt`:** optional buffer/chunk tuning when peers are on a TB link — see [docs/MULTINODE.md](docs/MULTINODE.md).
118-
- **Cabling / `169.254.x.x` / firewall:** [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md).
127+
First-time PyPI: create the **`mccl`** project on pypi.org (or change the package `name` everywhere if the name is taken).
119128

120-
## RDMA (Thunderbolt 5)
121-
122-
RDMA is **optional** and **not** what produced the benchmark plots above. It needs **TB5 hardware** (e.g. **M4 Pro / Max / Ultra** with TB5 ports), a **direct TB5 cable** (no hub), and a **macOS build that ships `librdma.dylib`** (Apple’s docs call out newer macOS versions). We have **not** done serious end-to-end training benchmarks on RDMA here — treat latency/BW claims you see elsewhere as **hypotheses until you measure**.
129+
## Collectives
123130

124-
**Enable once (Recovery OS):** `rdma_ctl enable` → reboot.
131+
`allreduce`, `broadcast`, `barrier`, `allgather`, `reduce_scatter`, `send`, `recv`
125132

126-
**Sanity check:**
133+
## Diagnostics
127134

128-
```bash
129-
python -c "import ctypes; ctypes.cdll.LoadLibrary('librdma.dylib'); print('RDMA lib OK')"
135+
```python
136+
mccl.get_metrics(); mccl.log_metrics(); mccl.reset_metrics()
130137
```
131138

132-
**Use in MCCL:** `MCCL_TRANSPORT=rdma` or `mccl.init(transport="rdma")`.
139+
Verbose startup: `MCCL_LOG_LEVEL=INFO`. Stuck multi-node: [docs/MULTINODE.md](docs/MULTINODE.md).
133140

134-
| Mode | Behavior |
135-
|------|----------|
136-
| `auto` (default) | Try RDMA, fall back to TCP |
137-
| `rdma` | RDMA preferred; TCP if init fails |
138-
| `tcp` | Skip RDMA; TCP only |
141+
## Transport
139142

140-
**Gotchas:** if RDMA init fails, logs at `MCCL_LOG_LEVEL=INFO` say why. Apple limits **~100 memory regions** per device — very large jobs can hit MR registration errors (reduce ranks or buffer churn). **PR [RESULTS.md](RESULTS.md)** if you have real TB5+RDMA training numbers.
143+
Bench plots were TCP over a Thunderbolt-style link, not RDMA. Wi‑Fi/Ethernet work, just slower. TB wiring: [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md). RDMA path exists on TB5-capable hardware + `librdma.dylib`; `rdma_ctl enable` from Recovery once; we didn’t use that for the graphs above.
141144

142145
## Internals
143146

144-
**f32 reductions on CPU-visible MPS memory:** element-wise adds, scales, min/max, product go through **Apple Accelerate / vDSP** (`vDSP_vadd`, `vDSP_vsmul`, …) in `csrc/metal/AccelerateOps.mm`, with a **parallel chunk loop** so big tensors use all performance cores. On Apple Silicon, those **vector library paths are AMX-backed where Accelerate routes them** — we’re not hand-writing AMX kernels; we lean on **vDSP + UMA** (CPU pointer is valid for shared MPS buffers, so reductions can run without extra copies in the common f32 case).
145-
146-
**fp16 / bf16:** Metal shaders when above `MCCL_GPU_THRESHOLD`; small tensors can widen → **f32 vDSP** → narrow.
147-
148-
**Network:** TCP progress thread, overlap, 2-rank vs ring allreduce. **CRC:** ARM `crc32` hw when enabled.
149-
150-
More layout / build: [docs/DEVELOPING.md](docs/DEVELOPING.md).
147+
f32 reductions: Accelerate/vDSP on CPU-visible MPS memory (`csrc/metal/AccelerateOps.mm`), chunked; Apple’s math can route through AMX. fp16/bf16: Metal when big enough, else widen to f32 and back. Network: TCP progress thread, ring/direct allreduce depending on world size. Details: [docs/DEVELOPING.md](docs/DEVELOPING.md)
151148

152149
## License
153150

pyproject.toml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,20 @@ requires-python = ">=3.11"
1616
dependencies = [
1717
"torch>=2.5.0",
1818
]
19+
classifiers = [
20+
"Development Status :: 4 - Beta",
21+
"Intended Audience :: Developers",
22+
"License :: OSI Approved :: MIT License",
23+
"Operating System :: MacOS :: MacOS X",
24+
"Programming Language :: Python :: 3.11",
25+
"Programming Language :: Python :: 3.12",
26+
"Programming Language :: Python :: 3.13",
27+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
28+
]
29+
# Replace with your GitHub path after publishing (badge + PyPI sidebar).
30+
[project.urls]
31+
Repository = "https://github.com/OWNER/REPO"
32+
1933

2034
[project.optional-dependencies]
2135
dev = [

0 commit comments

Comments
 (0)