Skip to content

Conversation

@MaoZiming
Copy link
Member

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

Base automatically changed from zm-amd-port to port-normal-amd November 5, 2025 11:54
@zhenhuang12 zhenhuang12 force-pushed the port-normal-amd branch 4 times, most recently from fa90b5d to 0fc9265 Compare November 5, 2025 12:02
Base automatically changed from port-normal-amd to main November 5, 2025 17:53
@MaoZiming
Copy link
Member Author

MaoZiming commented Nov 5, 2025

image

Have been debugging normal mode. cc @zhenhuang12
To reproduce

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=[0/1]   --master_addr=45.76.20.37 --master_port=12355   bench/test_internode.py  --num-tokens=512   --hidden=7168 --num-topk=1 --num-experts=288

@zhenhuang12
Copy link
Collaborator

Hi @MaoZiming, I have committed a fix for internode_dispatch. but result of dispatch is incorrect, it still has some errors. I am working on it.

image

@MaoZiming
Copy link
Member Author

Thanks! Run it just now and it seems to get stuck somehow. Will also take a look today or tomorrow to see if I can find anything.

@zhenhuang12
Copy link
Collaborator

Hi, @MaoZiming, I fixed a porting error. but it occasionally throws RDMA check failed , sometimes it can also get through a check .
you can reproduce by

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=[0/1]   --master_addr=xxx  --master_port=12355   bench/test_internode.py  --num-tokens=64

In other words, there might be a synchronization error somewhere in the dispatch kernel. Could you please help me confirm whether nvshmemi_ibgda_amo_nonfetch_add is correct?

image

@MaoZiming
Copy link
Member Author

@zhenhuang12 Thank you! Will take a look

@YangZhou1997
Copy link
Member

YangZhou1997 commented Nov 15, 2025

Hi @zhenhuang12 , I just successfully ran the normal kernel of this branch (with superficial modifications) on two Nebius H100 + CX7 instances. So I think the current CPU proxy RDMA operations should be correct, but the issue is on the AMD GPU kernel part.

My guess on the possible root cause is that the intranode token forwarding might read/write some stale values. Also, I find that the combine kernel will sometimes trigger this assert:

EP_HOST_ASSERT((num_forwarder_warps + 1) * WARP_SIZE <= MAX_NTHREADS);
. Perhaps it is related to the AMD 64 warp size and max 1024 threads? CC @MaoZiming

My modification is superficial, basically restoring the test_internode.py to use the main branch one and set # proxy threads and # FIFOs to be 1, see zm-amd-port-debug...yang-amd-normal.

@MaoZiming
Copy link
Member Author

@zhenhuang12

DeepEP dispatch forwarder timeout (RDMA check)

This timeout might mean somehow the GPU is not reading the atomic value written by the CPU. I think @YangZhou1997 has verified that the correct number of atomics are written by CPUs. So I suspect this issue might be due to memory consistency.

If you see timeouts related to

DeepEP dispatch NVL receiver timeout

This likely is at the https://github.com/uccl-project/uccl/blob/main/ep/src/internode.cu#L960-L1158 where the incorrect number of tokens are routed to GPUs. Two similar bugs I encountered and solved in the past that triggers this timeout: (1) the GPUs are not seeing the up-to-date head and tail buffer counter (2) somehow the data payload is overwritten by a different write.

I think @YangZhou1997 saw the second issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants