-
Notifications
You must be signed in to change notification settings - Fork 112
[EP] Debugging normal mode AMD port #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fa90b5d to
0fc9265
Compare
Have been debugging normal mode. cc @zhenhuang12 |
|
Hi @MaoZiming, I have committed a fix for internode_dispatch. but result of dispatch is incorrect, it still has some errors. I am working on it.
|
|
Thanks! Run it just now and it seems to get stuck somehow. Will also take a look today or tomorrow to see if I can find anything. |
a4684dd to
49c0bab
Compare
|
Hi, @MaoZiming, I fixed a porting error. but it occasionally throws RDMA check failed , sometimes it can also get through a check . torchrun --nnodes=2 --nproc_per_node=8 --node_rank=[0/1] --master_addr=xxx --master_port=12355 bench/test_internode.py --num-tokens=64In other words, there might be a synchronization error somewhere in the dispatch kernel. Could you please help me confirm whether
|
|
@zhenhuang12 Thank you! Will take a look |
|
Hi @zhenhuang12 , I just successfully ran the normal kernel of this branch (with superficial modifications) on two Nebius H100 + CX7 instances. So I think the current CPU proxy RDMA operations should be correct, but the issue is on the AMD GPU kernel part. My guess on the possible root cause is that the intranode token forwarding might read/write some stale values. Also, I find that the combine kernel will sometimes trigger this assert: Line 2665 in dabb7e0
My modification is superficial, basically restoring the test_internode.py to use the main branch one and set # proxy threads and # FIFOs to be 1, see zm-amd-port-debug...yang-amd-normal. |
This timeout might mean somehow the GPU is not reading the atomic value written by the CPU. I think @YangZhou1997 has verified that the correct number of atomics are written by CPUs. So I suspect this issue might be due to memory consistency. If you see timeouts related to This likely is at the https://github.com/uccl-project/uccl/blob/main/ep/src/internode.cu#L960-L1158 where the incorrect number of tokens are routed to GPUs. Two similar bugs I encountered and solved in the past that triggers this timeout: (1) the GPUs are not seeing the up-to-date head and tail buffer counter (2) somehow the data payload is overwritten by a different write. I think @YangZhou1997 saw the second issue. |



Description
Please include a summary of the changes and the related issue.
Fixes # (issue)
Type of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.sh.build_and_install.shto verify compilation.