Refresh async checkpoint IPC cache on pointer change by sbak5 · Pull Request #314 · NVIDIA/nvidia-resiliency-ext

sbak5 · 2026-04-28T23:07:05Z

add a CUDA IPC tensor fingerprint that includes data/storage pointers, tensor layout, dtype, and chunk identity
refresh the worker cached tensor data when the logical checkpoint structure matches but source tensor pointers changed
add a regression test that reallocates CUDA tensors under the same cached checkpoint structure

greptile-apps · 2026-04-28T23:41:48Z

Greptile Summary

This PR fixes a stale-cache bug in the async CUDA IPC checkpoint path where GPU tensors are reallocated between checkpoints under an unchanged logical checkpoint structure. It introduces a per-key tensor fingerprint (_cached_tensor_data_ptrs) that captures data/storage pointers, layout, and dtype; when the fingerprint differs from the cached one, the worker receives fresh IPC handles and updates its own cache, rather than blindly reusing stale ones.

Confidence Score: 4/5

Safe to merge; the fix is logically correct and well-tested with only minor style suggestions remaining.

No P0 or P1 issues found. The fingerprint comparison, cache update, worker-side refresh, failure-path cleanup, and test teardowns are all correct. Two P2 suggestions keep the score at 4.

No files require special attention; both changed files are clean.

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/checkpointing/async_ckpt/filesystem_async.py	Adds `_compute_tensor_data_ptrs` fingerprint and `_cached_tensor_data_ptrs` class-level cache; refreshes worker GPU-IPC data when pointers change while the checkpoint structure key is unchanged. Cache is cleared on worker failure and on `clear_cache`. Logic is sound; minor defensive assertion missing in `zip` over parallel lists.
tests/checkpointing/unit/test_async_writer.py	Adds regression test that reallocates CUDA tensors under a fixed checkpoint structure, verifies pointer uniqueness, then confirms the final loaded checkpoint reflects the last-written values. Also adds `_cached_tensor_data_ptrs.clear()` to existing test teardowns to prevent cross-test contamination.

_{Reviews (1): Last reviewed commit: "Refresh async checkpoint IPC cache on po..." | Re-trigger Greptile}

greptile-apps · 2026-04-28T23:41:51Z

+def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]:
+    """Compute a storage identity fingerprint for tensors cached via CUDA IPC."""
+    ptrs = []


Silent truncation in zip over items/tensors

zip(items, tensors) silently stops at the shorter of the two sequences if their lengths ever differ. Because gpu_items and gpu_data are always produced together by separate_cacheable, they should always be the same length in practice — but an assertion would make this contract explicit and catch future regressions early.

Suggested change

def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]:

"""Compute a storage identity fingerprint for tensors cached via CUDA IPC."""

ptrs = []

assert len(items) == len(tensors), (

f"items and tensors must be the same length, got {len(items)} vs {len(tensors)}"

)

ptrs = []

for item, tensor in zip(items, tensors):

greptile-apps · 2026-04-28T23:41:52Z

+
+        state_dicts = []
+        data_ptrs = []
+        last_ckpt_dir = None


Test correctness relies on implicit memory-reuse prevention

state_dicts.append(state_dict) keeps all three state dicts (and their tensors) alive throughout the loop, which prevents the CUDA allocator from recycling a freed address and producing a coincidental pointer match. This invariant is load-bearing for the len(set(data_ptrs)) == 3 assertion: if a state dict were dropped before the end of the loop, the allocator could reuse the address and collapse two entries to the same pointer, making the assertion meaningless. A short comment here would document the intent and guard against accidental simplification later.

sbak5 force-pushed the sbak/ipc-cache-data-ptr branch from ce79cf5 to d7b6b93 Compare April 28, 2026 23:24

Refresh async checkpoint IPC cache on pointer change

9cdb8e7

sbak5 force-pushed the sbak/ipc-cache-data-ptr branch from d7b6b93 to 9cdb8e7 Compare April 28, 2026 23:33

sbak5 marked this pull request as ready for review April 28, 2026 23:34

sbak5 requested review from ankurv-nvidia and hexinw-nvidia April 28, 2026 23:34

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

hexinw-nvidia added the ci-approved Approved to run CI label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh async checkpoint IPC cache on pointer change#314

Refresh async checkpoint IPC cache on pointer change#314
sbak5 wants to merge 1 commit into
NVIDIA:mainfrom
sbak5:sbak/ipc-cache-data-ptr

sbak5 commented Apr 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sbak5 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sbak5 commented Apr 28, 2026 •

edited

Loading