Skip to content

Refresh async checkpoint IPC cache on pointer change#314

Open
sbak5 wants to merge 1 commit into
NVIDIA:mainfrom
sbak5:sbak/ipc-cache-data-ptr
Open

Refresh async checkpoint IPC cache on pointer change#314
sbak5 wants to merge 1 commit into
NVIDIA:mainfrom
sbak5:sbak/ipc-cache-data-ptr

Conversation

@sbak5

@sbak5 sbak5 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor
  • add a CUDA IPC tensor fingerprint that includes data/storage pointers, tensor layout, dtype, and chunk identity
  • refresh the worker cached tensor data when the logical checkpoint structure matches but source tensor pointers changed
  • add a regression test that reallocates CUDA tensors under the same cached checkpoint structure

@sbak5 sbak5 force-pushed the sbak/ipc-cache-data-ptr branch from ce79cf5 to d7b6b93 Compare April 28, 2026 23:24
@sbak5 sbak5 force-pushed the sbak/ipc-cache-data-ptr branch from d7b6b93 to 9cdb8e7 Compare April 28, 2026 23:33
@sbak5 sbak5 marked this pull request as ready for review April 28, 2026 23:34
@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a stale-cache bug in the async CUDA IPC checkpoint path where GPU tensors are reallocated between checkpoints under an unchanged logical checkpoint structure. It introduces a per-key tensor fingerprint (_cached_tensor_data_ptrs) that captures data/storage pointers, layout, and dtype; when the fingerprint differs from the cached one, the worker receives fresh IPC handles and updates its own cache, rather than blindly reusing stale ones.

Confidence Score: 4/5

Safe to merge; the fix is logically correct and well-tested with only minor style suggestions remaining.

No P0 or P1 issues found. The fingerprint comparison, cache update, worker-side refresh, failure-path cleanup, and test teardowns are all correct. Two P2 suggestions keep the score at 4.

No files require special attention; both changed files are clean.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/checkpointing/async_ckpt/filesystem_async.py Adds _compute_tensor_data_ptrs fingerprint and _cached_tensor_data_ptrs class-level cache; refreshes worker GPU-IPC data when pointers change while the checkpoint structure key is unchanged. Cache is cleared on worker failure and on clear_cache. Logic is sound; minor defensive assertion missing in zip over parallel lists.
tests/checkpointing/unit/test_async_writer.py Adds regression test that reallocates CUDA tensors under a fixed checkpoint structure, verifies pointer uniqueness, then confirms the final loaded checkpoint reflects the last-written values. Also adds _cached_tensor_data_ptrs.clear() to existing test teardowns to prevent cross-test contamination.

Reviews (1): Last reviewed commit: "Refresh async checkpoint IPC cache on po..." | Re-trigger Greptile

Comment on lines +119 to +121
def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]:
"""Compute a storage identity fingerprint for tensors cached via CUDA IPC."""
ptrs = []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent truncation in zip over items/tensors

zip(items, tensors) silently stops at the shorter of the two sequences if their lengths ever differ. Because gpu_items and gpu_data are always produced together by separate_cacheable, they should always be the same length in practice — but an assertion would make this contract explicit and catch future regressions early.

Suggested change
def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]:
"""Compute a storage identity fingerprint for tensors cached via CUDA IPC."""
ptrs = []
assert len(items) == len(tensors), (
f"items and tensors must be the same length, got {len(items)} vs {len(tensors)}"
)
ptrs = []
for item, tensor in zip(items, tensors):


state_dicts = []
data_ptrs = []
last_ckpt_dir = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test correctness relies on implicit memory-reuse prevention

state_dicts.append(state_dict) keeps all three state dicts (and their tensors) alive throughout the loop, which prevents the CUDA allocator from recycling a freed address and producing a coincidental pointer match. This invariant is load-bearing for the len(set(data_ptrs)) == 3 assertion: if a state dict were dropped before the end of the loop, the allocator could reuse the address and collapse two entries to the same pointer, making the assertion meaningless. A short comment here would document the intent and guard against accidental simplification later.

@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants