[fully_async, ckpt, rollout, trainer, tool, cfg] fix: ROCm async training compatibility for AMD MI300X#6002
Conversation
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request optimizes weight transfer by implementing buffer reuse in the BucketedWeightSender and persisting the weight sender instance. It also replaces CuPy with Torch in the NCCL checkpoint engine and updates ZMQ handle naming. Review feedback identifies a critical bug where ZMQ handles are inconsistently named using global ranks on the sender side and local ranks on the receiver side, which will cause failures in multi-node environments. Additionally, a suggestion was made to explicitly nullify buffers when unlinking shared memory to prevent potential use-after-free issues.
31ea03f to
92cdd55
Compare
Fix multiple issues that prevent fully async FSDP2 training from working on AMD ROCm platforms (tested on MI308X with ROCm 7.2, also verified compatible on H20 with CUDA): 1. Unify NCCL checkpoint engine buffers to torch tensors, replacing cupy which causes HIP stream synchronization issues on ROCm 2. Add HSA_NO_SCRATCH_RECLAIM env var required by AMD RCCL 3. Fix numpy.bool_ JSON serialization with numpy 2.x 4. Materialize generator before send_weights to prevent FSDP all_gather deadlock across ranks 5. Use deterministic rank-based ZMQ IPC handles instead of GPU UUID which differs between checkpoint engine and vLLM workers on ROCm 6. Clean up stale ZMQ IPC socket files to prevent bind failures on restart 7. Fix Hydra searchpath to use pkg:// instead of file:// for editable installs 8. Add get_if_exists to sandbox Ray actor to prevent duplicate creation 9. Persist weight sync buffers (NCCL + IPC) to prevent HIP memory fragmentation OOM from repeated alloc/free cycles Made-with: Cursor
92cdd55 to
d01bdc3
Compare
| if self.config.rollout.checkpoint_engine.backend != "naive": | ||
| per_tensor_param, _ = self.actor.engine.get_per_tensor_param() | ||
| await self.checkpoint_engine.send_weights(per_tensor_param) | ||
| per_tensor_param = list(per_tensor_param) |
There was a problem hiding this comment.
This will materialize weight generator and gather all sharded weight into each GPU, causing cuda oom for large model.
|
|
||
| self.device_uuid = get_device_uuid(get_device_id()) | ||
| self.zmq_handle = f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock" | ||
| self.zmq_handle = f"ipc:///tmp/rl-colocate-zmq-rank-{rank % local_world_size}.sock" |
There was a problem hiding this comment.
This will conflict for multiple vllm replicas in same node, e.g 2 replicas with TP=4 located on same node.
|
|
||
| def prepare(self) -> MasterMetadata: | ||
| # For master process, use cupy instead of torch to avoid memory register error | ||
| # when `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. |
There was a problem hiding this comment.
Please respect this comment.
What does this PR do?
Fix multiple issues that prevent fully async FSDP2 training from working on AMD ROCm platforms (MI300X series).
Environment:
AttributeError: 'list' object has no attribute 'dim'inagent_loop.py:696) is hit — this bug exists with or without this patch and is unrelated to these changes.Training curves (MI308X vs H20) and training script
dapo_7b_fully_async.sh

Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI)Test
Validated by full async FSDP2 DAPO/GRPO RL + ReTool training on AMD MI308X:
Applied this patch on latest verl main (commit 9b54564) + vLLM 0.18.2 on NVIDIA H20. Training ran normally up to step 86 / global_step 344, where a pre-existing bug in agent_loop.py:698 is hit:
File "verl/experimental/agent_loop/agent_loop.py", line 698, in _agent_loop_postprocess
if response_mask_output["input_ids"].dim() == 1:
AttributeError: 'list' object has no attribute 'dim'
This is caused by tokenizer.pad() returning a Python list instead of a torch.Tensor for response_mask in certain edge cases, even with return_tensors="pt". This bug exists on the current main branch with or without this patch — it is not introduced by any changes in this PR. The file agent_loop.py is not modified in this PR.
MI308X (ROCm 7.2): 250+ global steps, 60+ weight syncs completed without OOM or deadlock. The same agent_loop.py bug was also encountered on MI308X at a later step, confirming it is platform-independent.
API and Usage Example
No API changes. All fixes are internal implementation details.
Design & Code Changes
NCCL checkpoint engine: unify buffers to torch tensors (
nccl_checkpoint_engine.py)cupybuffers withtorch.zerosto fix HIP stream synchronization issues on ROCmcp.asarray()conversion insend_weightsAdd
HSA_NO_SCRATCH_RECLAIMenv var (constants_ppo.py)ncclSystemErrorFix
numpy.bool_JSON serialization (ray_trainer.py)default=strfallback forjson.dumpssince numpy 2.xbool_is no longer a PythonboolsubclassMaterialize generator before
send_weights(engine_workers.py)get_per_tensor_param()returns a generator containingfull_tensor()calls that trigger FSDPall_gatherlist()to materialize +torch.cuda.synchronize()before sendingZMQ IPC handle: use rank instead of GPU UUID (
vllm_rollout.py,utils.py)CheckpointEngineWorkerand vLLM worker see different GPU UUIDs due to differentCUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICESsettingsClean up stale ZMQ IPC socket files (
bucketed_weight_transfer.py)/tmp/rl-colocate-zmq-rank-*.sockfiles beforebind()and afterclose()to preventAddress already in useon restartFix Hydra searchpath (
fully_async_ppo_trainer.yaml)pkg://verl.trainer.configinstead offile://verl/trainer/configfor editable installsSandbox Ray actor reuse (
sandbox_fusion_tools.py)nameandget_if_exists=Trueto prevent duplicateExecutionWorkeractor creationPersist weight sync buffers to prevent OOM (
nccl_checkpoint_engine.py,vllm_rollout.py,bucketed_weight_transfer.py)torch.cuda.empty_cache()does not effectively return physical memory to the systemChecklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.