You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`ddp_dummy_train.py` defaults: **DDP**`BATCH_SIZE=128` per rank (global **256** with 2 ranks); **`--baseline`** global **256** unless you set `BASELINE_BATCH_SIZE` / `BATCH_SIZE`. Lower if you OOM.
baseline / DDP: 0.58× → DDP ~172% of baseline samples/s
60
-
params: 96,510,024 (same both sides)
100
+
baseline / DDP: 0.58× (~172% DDP vs baseline)
61
101
```
62
102
63
-
**Global batch 256:** baseline = `BASELINE_BATCH_SIZE=256` (or `BATCH_SIZE=256` on one process). DDP = **`BATCH_SIZE=128` per rank × 2 ranks** (or 256×1 on two nodes — same global).
64
-
65
-
**Takeaway:** with this **~96M-param** model, **big batch + higher GPU/memory util**, **2-rank MCCL DDP beat one M1 Max** on **samples/s** in our JSON. With **small global batch** (we’ve seen **global 8**), **comm/sync dominated** and DDP looked **much slower**. **Hardware mix still matters** (slowest rank sets the step); **PR [RESULTS.md](RESULTS.md)** with your setup.
103
+
Tiny batches = comm noise dominates. Different chips on each rank = slowest one paces the step.
66
104
67
105
```bash
68
-
# Match global batch: e.g. DDP BATCH_SIZE=128 × world 2 → baseline BASELINE_BATCH_SIZE=256
**CI (tests only):** push to **`main`** or **`master`**, or open a PR targeting those branches. That runs [`.github/workflows/ci.yml`](.github/workflows/ci.yml) — it does **not** upload to PyPI.
83
120
84
-
## Configuration
85
-
86
-
Python `MCCL.init` / `MCCLConfig` or env; Python wins if set before `init_process_group`.
**Upload:**[`.github/workflows/publish.yml`](.github/workflows/publish.yml) runs on **GitHub Release (published)** or **Actions → Publish to PyPI → Run workflow** (`workflow_dispatch`).
104
122
105
-
## Diagnostics
123
+
1. GitHub repo → **Settings → Secrets and variables → Actions** → New repository secret **`PYPI_API_TOKEN`** (PyPI → Account settings → API tokens).
124
+
2. Bump **`version`** in `pyproject.toml`, `setup.py`, and the assertion in `tests/test_build.py`.
125
+
3. Either: **Releases → Draft a new release** → publish (triggers upload), or **Actions** tab → **Publish to PyPI** → **Run workflow** → branch `main`.
`MCCL_LOG_LEVEL=INFO` — full config on startup. **Hung multi-node:**[docs/MULTINODE.md](docs/MULTINODE.md).
112
-
113
-
## Thunderbolt, TCP, and what was tested
114
-
115
-
-**Throughput charts in this repo:****TCP** over a **Thunderbolt bridge IP** (TB3-class link here). **Not RDMA.** Plots match the **high global-batch** run above unless you regenerate.
116
-
-**Ethernet / Wi‑Fi:** same code paths; expect worse RTT and throughput than a direct TB cable between hosts.
117
-
-**`MCCL_LINK_PROFILE=thunderbolt`:** optional buffer/chunk tuning when peers are on a TB link — see [docs/MULTINODE.md](docs/MULTINODE.md).
First-time PyPI: create the **`mccl`** project on pypi.org (or change the package `name` everywhere if the name is taken).
119
128
120
-
## RDMA (Thunderbolt 5)
121
-
122
-
RDMA is **optional** and **not** what produced the benchmark plots above. It needs **TB5 hardware** (e.g. **M4 Pro / Max / Ultra** with TB5 ports), a **direct TB5 cable** (no hub), and a **macOS build that ships `librdma.dylib`** (Apple’s docs call out newer macOS versions). We have **not** done serious end-to-end training benchmarks on RDMA here — treat latency/BW claims you see elsewhere as **hypotheses until you measure**.
129
+
## Collectives
123
130
124
-
**Enable once (Recovery OS):**`rdma_ctl enable` → reboot.
**Gotchas:** if RDMA init fails, logs at `MCCL_LOG_LEVEL=INFO` say why. Apple limits **~100 memory regions** per device — very large jobs can hit MR registration errors (reduce ranks or buffer churn). **PR [RESULTS.md](RESULTS.md)** if you have real TB5+RDMA training numbers.
143
+
Bench plots were TCP over a Thunderbolt-style link, not RDMA. Wi‑Fi/Ethernet work, just slower. TB wiring: [docs/THUNDERBOLT_SETUP.md](docs/THUNDERBOLT_SETUP.md). RDMA path exists on TB5-capable hardware + `librdma.dylib`; `rdma_ctl enable` from Recovery once; we didn’t use that for the graphs above.
141
144
142
145
## Internals
143
146
144
-
**f32 reductions on CPU-visible MPS memory:** element-wise adds, scales, min/max, product go through **Apple Accelerate / vDSP** (`vDSP_vadd`, `vDSP_vsmul`, …) in `csrc/metal/AccelerateOps.mm`, with a **parallel chunk loop** so big tensors use all performance cores. On Apple Silicon, those **vector library paths are AMX-backed where Accelerate routes them** — we’re not hand-writing AMX kernels; we lean on **vDSP + UMA** (CPU pointer is valid for shared MPS buffers, so reductions can run without extra copies in the common f32 case).
145
-
146
-
**fp16 / bf16:** Metal shaders when above `MCCL_GPU_THRESHOLD`; small tensors can widen → **f32 vDSP** → narrow.
147
-
148
-
**Network:** TCP progress thread, overlap, 2-rank vs ring allreduce. **CRC:** ARM `crc32` hw when enabled.
149
-
150
-
More layout / build: [docs/DEVELOPING.md](docs/DEVELOPING.md).
147
+
f32 reductions: Accelerate/vDSP on CPU-visible MPS memory (`csrc/metal/AccelerateOps.mm`), chunked; Apple’s math can route through AMX. fp16/bf16: Metal when big enough, else widen to f32 and back. Network: TCP progress thread, ring/direct allreduce depending on world size. Details: [docs/DEVELOPING.md](docs/DEVELOPING.md)
0 commit comments