Skip to content

Fix RoCE GID cache after port flap during P2P recovery#2183

Open
daxayrawal wants to merge 2 commits into
NVIDIA:masterfrom
daxayrawal:RoCE_failback_bug
Open

Fix RoCE GID cache after port flap during P2P recovery#2183
daxayrawal wants to merge 2 commits into
NVIDIA:masterfrom
daxayrawal:RoCE_failback_bug

Conversation

@daxayrawal

Copy link
Copy Markdown

Problem:
After a RoCE port flap, the kernel may re-register GIDs at a different index, leaving NCCL's cached localGidIndex invalid for recovery QPs. Solution:
Add ncclIbRefreshGidInfo() and rebuild recovery QPs with a refreshed GID on Ethernet links during P2P resiliency recovery.

Description

Problem

On RoCE (Ethernet link layer), NCCL caches localGid / localGidIndex at connect time and reuses them when moving QPs through reset → RTR → RTS during P2P port recovery. After a port flap, the kernel may re-register GIDs at a different GID table index. The cached sgid_index can then point at an empty or stale slot. Recovery QPs and data QPs modified back to RTR with the old GID fail to come up correctly, so the alive-message handshake and subsequent failback do not complete.

Native IB is unaffected: GID layout is stable across port events, so no refresh is performed for IBV_LINK_LAYER_INFINIBAND.

Solution

ncclIbRefreshGidInfo() (connect.cc, declared in common.h): For Ethernet ports only, re-query ibv_query_port, re-run ncclIbGetGidIndex(), and update gidInfo and ibDev->portAttr. On query failure, log at NCCL_NET and keep the existing cache (no hard failure).

ncclIbPortRecoveryQpsRebuild() (p2p_resiliency_recovery.cc): Before the alive-message phase, refresh GID info for the failed device, reset/re-init the port-recovery QP, transition to RTS with the updated localGid / localGidIndex, and re-post receive WRs flushed by reset (batch on receiver, single ACK WR on sender).

Recovery progress gating: In ncclIbPortRecoveryStateAliveMessages, retry rebuild throttled by IbResiliencyPortRecoveryAliveMsgBatchInterval until rebuild succeeds; reset alive-message state after a successful rebuild.

Data-path QP restore: In ncclIbPortRecoveryQpsRestore(), apply refreshed devBase->gidInfo to each data QP’s rtrAttr (and flush QP local/remote GID) before ncclIbQpRtr().

Testing

RoCE: induce port down/up (or cable pull) on a backend NIC during an active NCCL job with P2P resiliency enabled; verify recovery completes and collective traffic resumes.

Confirm
NET/IB: ... refreshed GID .../ Rebuilt recovery QP ... log lines when GID index changes after flap.

IB (non-Ethernet) path: no behavioral change expected (ncclIbRefreshGidInfo
returns immediately).

make -j src.build
clean on target platform.

Related Issues

#2157

Changes & Impact

src/transport/net_ib/common.h : Declare ncclIbRefreshGidInfo(). Impact: Shared entry point for GID refresh from recovery code
src/transport/net_ib/connect.cc : Add ncclIbRefreshGidInfo(). Impact: Re-queries port + GID table on RoCE after flap; updates gidInfo and ibDev->portAttr; no-op on native IB
src/transport/net_ib/p2p_resiliency_recovery.cc : Add ncclIbPortRecoveryQpsRebuild(); extend ncclIbPortRecoveryContext with qpsRebuilt / lastRebuildAttemptNs; gate alive-message progress on successful rebuild; refresh GID on data + flush QPs in ncclIbPortRecoveryQpsRestore(), Impact:Port-recovery handshake uses valid sgid_index after flap; data-path QPs restored with current GID instead of connect-time cache

Performance Impact

Steady state (no port failure): None. ncclIbRefreshGidInfo() is only called from the port-recovery rebuild path, not from the normal send/recv hot path.

During port recovery: Small, bounded overhead on the failed device while the link is down or GIDs are settling:

Periodic ibv_query_port / GID table walk (same logic as initial connect) at most once per IbResiliencyPortRecoveryAliveMsgBatchInterval until rebuild succeeds.
One-time recovery QP reset → init → RTR → RTS and re-post of port-recovery receive WRs per successful rebuild.
Data QPs on restore already go through reset/RTR/RTS; this change only updates GID fields before ncclIbQpRtr() (no extra QPs).
Expected benefit: Restores correct failback behavior after flap; without the fix, recovery can stall indefinitely with stale GIDs—effectively zero useful bandwidth on the failed path until job restart.

No change to collective algorithms, channel count, or GDR paths.

Problem:
After a RoCE port flap, the kernel may re-register GIDs at a different
index, leaving NCCL's cached localGidIndex invalid for recovery QPs.
Solution:
Add ncclIbRefreshGidInfo() and rebuild recovery QPs with a refreshed GID
on Ethernet links during P2P resiliency recovery.

Signed-off-by: daxayrawal <daxayrawal@hotmail.com>
Signed-off-by: daxayrawal <11860103+daxayrawal@users.noreply.github.com>
@thomasgillis thomasgillis self-requested a review May 20, 2026 21:48
@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants