[EP] Debugging normal mode AMD port #512

MaoZiming · 2025-11-04T23:43:41Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

…to zm-amd-port-debug

MaoZiming · 2025-11-05T18:57:13Z

Have been debugging normal mode. cc @zhenhuang12
To reproduce

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=[0/1]   --master_addr=45.76.20.37 --master_port=12355   bench/test_internode.py  --num-tokens=512   --hidden=7168 --num-topk=1 --num-experts=288

zhenhuang12 · 2025-11-06T14:58:40Z

Hi @MaoZiming, I have committed a fix for internode_dispatch. but result of dispatch is incorrect, it still has some errors. I am working on it.

MaoZiming · 2025-11-06T17:29:26Z

Thanks! Run it just now and it seems to get stuck somehow. Will also take a look today or tomorrow to see if I can find anything.

zhenhuang12 · 2025-11-11T16:13:14Z

Hi, @MaoZiming, I fixed a porting error. but it occasionally throws RDMA check failed , sometimes it can also get through a check .
you can reproduce by

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=[0/1]   --master_addr=xxx  --master_port=12355   bench/test_internode.py  --num-tokens=64

In other words, there might be a synchronization error somewhere in the dispatch kernel. Could you please help me confirm whether nvshmemi_ibgda_amo_nonfetch_add is correct?

MaoZiming · 2025-11-11T17:11:20Z

@zhenhuang12 Thank you! Will take a look

…md-port-debug

YangZhou1997 · 2025-11-15T06:00:09Z

Hi @zhenhuang12 , I just successfully ran the normal kernel of this branch (with superficial modifications) on two Nebius H100 + CX7 instances. So I think the current CPU proxy RDMA operations should be correct, but the issue is on the AMD GPU kernel part.

My guess on the possible root cause is that the intranode token forwarding might read/write some stale values. Also, I find that the combine kernel will sometimes trigger this assert:

uccl/ep/src/internode.cu

Line 2665 in dabb7e0

EP_HOST_ASSERT((num_forwarder_warps + 1) * WARP_SIZE <= MAX_NTHREADS);

. Perhaps it is related to the AMD 64 warp size and max 1024 threads? CC @MaoZiming

My modification is superficial, basically restoring the test_internode.py to use the main branch one and set # proxy threads and # FIFOs to be 1, see zm-amd-port-debug...yang-amd-normal.

MaoZiming · 2025-11-15T17:31:24Z

@zhenhuang12

DeepEP dispatch forwarder timeout (RDMA check)

This timeout might mean somehow the GPU is not reading the atomic value written by the CPU. I think @YangZhou1997 has verified that the correct number of atomics are written by CPUs. So I suspect this issue might be due to memory consistency.

If you see timeouts related to

DeepEP dispatch NVL receiver timeout

This likely is at the https://github.com/uccl-project/uccl/blob/main/ep/src/internode.cu#L960-L1158 where the incorrect number of tokens are routed to GPUs. Two similar bugs I encountered and solved in the past that triggers this timeout: (1) the GPUs are not seeing the up-to-date head and tail buffer counter (2) somehow the data payload is overwritten by a different write.

I think @YangZhou1997 saw the second issue.

zhenhuang12 and others added 8 commits November 3, 2025 06:07

EP: port internode_dispatch to amd.

b7c1fb1

EP: port internode_combine to amd.

05e7df8

ziming fix amd port

04c6a14

fixing wr bug

946db83

debugging

fbda03b

clean

d6f3355

Merge branch 'zm-amd-port' of https://github.com/uccl-project/uccl in…

1903181

…to zm-amd-port-debug

checkpt

7324fcf

Base automatically changed from zm-amd-port to port-normal-amd November 5, 2025 11:54

zhenhuang12 force-pushed the port-normal-amd branch 4 times, most recently from fa90b5d to 0fc9265 Compare November 5, 2025 12:02

Base automatically changed from port-normal-amd to main November 5, 2025 17:53

merge main

b136aaa

MaoZiming and others added 2 commits November 5, 2025 20:38

adding wr_id_to_wr_ids emplace for normal mode atomics

4d63369

EP: fix ep internode

de4abc4

EP: fix RDMAAndNVLForwarder copy data

49c0bab

zhenhuang12 force-pushed the zm-amd-port-debug branch from a4684dd to 49c0bab Compare November 11, 2025 16:03

MaoZiming and others added 4 commits November 11, 2025 22:51

merge main

e5b2ef3

Merge branch 'main' of https://github.com/uccl-project/uccl into zm-a…

a4d9d4b

…md-port-debug

Merge branch 'main' of https://github.com/uccl-project/uccl into zm-a…

026a836

…md-port-debug

format

dabb7e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] Debugging normal mode AMD port #512

[EP] Debugging normal mode AMD port #512

Uh oh!

MaoZiming commented Nov 4, 2025

Uh oh!

MaoZiming commented Nov 5, 2025 •

edited

Loading

Uh oh!

zhenhuang12 commented Nov 6, 2025

Uh oh!

MaoZiming commented Nov 6, 2025

Uh oh!

zhenhuang12 commented Nov 11, 2025

Uh oh!

MaoZiming commented Nov 11, 2025

Uh oh!

YangZhou1997 commented Nov 15, 2025 •

edited

Loading

Uh oh!

MaoZiming commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[EP] Debugging normal mode AMD port #512

Are you sure you want to change the base?

[EP] Debugging normal mode AMD port #512

Uh oh!

Conversation

MaoZiming commented Nov 4, 2025

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

MaoZiming commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhenhuang12 commented Nov 6, 2025

Uh oh!

MaoZiming commented Nov 6, 2025

Uh oh!

zhenhuang12 commented Nov 11, 2025

Uh oh!

MaoZiming commented Nov 11, 2025

Uh oh!

YangZhou1997 commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaoZiming commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaoZiming commented Nov 5, 2025 •

edited

Loading

YangZhou1997 commented Nov 15, 2025 •

edited

Loading