Refresh async checkpoint IPC cache on pointer change#314
Conversation
ce79cf5 to
d7b6b93
Compare
d7b6b93 to
9cdb8e7
Compare
Greptile SummaryThis PR fixes a stale-cache bug in the async CUDA IPC checkpoint path where GPU tensors are reallocated between checkpoints under an unchanged logical checkpoint structure. It introduces a per-key tensor fingerprint ( Confidence Score: 4/5Safe to merge; the fix is logically correct and well-tested with only minor style suggestions remaining. No P0 or P1 issues found. The fingerprint comparison, cache update, worker-side refresh, failure-path cleanup, and test teardowns are all correct. Two P2 suggestions keep the score at 4. No files require special attention; both changed files are clean. Important Files Changed
Reviews (1): Last reviewed commit: "Refresh async checkpoint IPC cache on po..." | Re-trigger Greptile |
| def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]: | ||
| """Compute a storage identity fingerprint for tensors cached via CUDA IPC.""" | ||
| ptrs = [] |
There was a problem hiding this comment.
Silent truncation in
zip over items/tensors
zip(items, tensors) silently stops at the shorter of the two sequences if their lengths ever differ. Because gpu_items and gpu_data are always produced together by separate_cacheable, they should always be the same length in practice — but an assertion would make this contract explicit and catch future regressions early.
| def _compute_tensor_data_ptrs(items: List[WriteItem], tensors: List[Any]) -> Tuple[Tuple, ...]: | |
| """Compute a storage identity fingerprint for tensors cached via CUDA IPC.""" | |
| ptrs = [] | |
| assert len(items) == len(tensors), ( | |
| f"items and tensors must be the same length, got {len(items)} vs {len(tensors)}" | |
| ) | |
| ptrs = [] | |
| for item, tensor in zip(items, tensors): |
|
|
||
| state_dicts = [] | ||
| data_ptrs = [] | ||
| last_ckpt_dir = None |
There was a problem hiding this comment.
Test correctness relies on implicit memory-reuse prevention
state_dicts.append(state_dict) keeps all three state dicts (and their tensors) alive throughout the loop, which prevents the CUDA allocator from recycling a freed address and producing a coincidental pointer match. This invariant is load-bearing for the len(set(data_ptrs)) == 3 assertion: if a state dict were dropped before the end of the loop, the allocator could reuse the address and collapse two entries to the same pointer, making the assertion meaningless. A short comment here would document the intent and guard against accidental simplification later.
Uh oh!
There was an error while loading. Please reload this page.