Fix RoCE GID cache after port flap during P2P recovery#2183
Open
daxayrawal wants to merge 2 commits into
Open
Conversation
Problem: After a RoCE port flap, the kernel may re-register GIDs at a different index, leaving NCCL's cached localGidIndex invalid for recovery QPs. Solution: Add ncclIbRefreshGidInfo() and rebuild recovery QPs with a refreshed GID on Ethernet links during P2P resiliency recovery. Signed-off-by: daxayrawal <daxayrawal@hotmail.com> Signed-off-by: daxayrawal <11860103+daxayrawal@users.noreply.github.com>
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
After a RoCE port flap, the kernel may re-register GIDs at a different index, leaving NCCL's cached localGidIndex invalid for recovery QPs. Solution:
Add ncclIbRefreshGidInfo() and rebuild recovery QPs with a refreshed GID on Ethernet links during P2P resiliency recovery.
Description
Problem
On RoCE (Ethernet link layer), NCCL caches localGid / localGidIndex at connect time and reuses them when moving QPs through reset → RTR → RTS during P2P port recovery. After a port flap, the kernel may re-register GIDs at a different GID table index. The cached sgid_index can then point at an empty or stale slot. Recovery QPs and data QPs modified back to RTR with the old GID fail to come up correctly, so the alive-message handshake and subsequent failback do not complete.
Native IB is unaffected: GID layout is stable across port events, so no refresh is performed for IBV_LINK_LAYER_INFINIBAND.
Solution
ncclIbRefreshGidInfo() (connect.cc, declared in common.h): For Ethernet ports only, re-query ibv_query_port, re-run ncclIbGetGidIndex(), and update gidInfo and ibDev->portAttr. On query failure, log at NCCL_NET and keep the existing cache (no hard failure).
ncclIbPortRecoveryQpsRebuild() (p2p_resiliency_recovery.cc): Before the alive-message phase, refresh GID info for the failed device, reset/re-init the port-recovery QP, transition to RTS with the updated localGid / localGidIndex, and re-post receive WRs flushed by reset (batch on receiver, single ACK WR on sender).
Recovery progress gating: In ncclIbPortRecoveryStateAliveMessages, retry rebuild throttled by IbResiliencyPortRecoveryAliveMsgBatchInterval until rebuild succeeds; reset alive-message state after a successful rebuild.
Data-path QP restore: In ncclIbPortRecoveryQpsRestore(), apply refreshed devBase->gidInfo to each data QP’s rtrAttr (and flush QP local/remote GID) before ncclIbQpRtr().
Testing
RoCE: induce port down/up (or cable pull) on a backend NIC during an active NCCL job with P2P resiliency enabled; verify recovery completes and collective traffic resumes.
Confirm
NET/IB: ... refreshed GID .../ Rebuilt recovery QP ... log lines when GID index changes after flap.
IB (non-Ethernet) path: no behavioral change expected (ncclIbRefreshGidInfo
returns immediately).
make -j src.build
clean on target platform.
Related Issues
#2157
Changes & Impact
src/transport/net_ib/common.h : Declare ncclIbRefreshGidInfo(). Impact: Shared entry point for GID refresh from recovery code
src/transport/net_ib/connect.cc : Add ncclIbRefreshGidInfo(). Impact: Re-queries port + GID table on RoCE after flap; updates gidInfo and ibDev->portAttr; no-op on native IB
src/transport/net_ib/p2p_resiliency_recovery.cc : Add ncclIbPortRecoveryQpsRebuild(); extend ncclIbPortRecoveryContext with qpsRebuilt / lastRebuildAttemptNs; gate alive-message progress on successful rebuild; refresh GID on data + flush QPs in ncclIbPortRecoveryQpsRestore(), Impact:Port-recovery handshake uses valid sgid_index after flap; data-path QPs restored with current GID instead of connect-time cache
Performance Impact
Steady state (no port failure): None. ncclIbRefreshGidInfo() is only called from the port-recovery rebuild path, not from the normal send/recv hot path.
During port recovery: Small, bounded overhead on the failed device while the link is down or GIDs are settling:
Periodic ibv_query_port / GID table walk (same logic as initial connect) at most once per IbResiliencyPortRecoveryAliveMsgBatchInterval until rebuild succeeds.
One-time recovery QP reset → init → RTR → RTS and re-post of port-recovery receive WRs per successful rebuild.
Data QPs on restore already go through reset/RTR/RTS; this change only updates GID fields before ncclIbQpRtr() (no extra QPs).
Expected benefit: Restores correct failback behavior after flap; without the fix, recovery can stall indefinitely with stale GIDs—effectively zero useful bandwidth on the failed path until job restart.
No change to collective algorithms, channel count, or GDR paths.