You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you mind sharing why consuming 2x memory is an issue for you ? Adding context is likely to help others as well.
In general for GPU, the CPU RAM is more than well equipped with dealing with this. I'm asking because usually this extra copy can be made "free" because CPU and GPU work overlap.
There's a historical reason for doing this copy.
PyBuffer wasn't stabilized (in ABI) until 3.11 which came after this lib was created. (And also I'm pretty sure pyo3 support for it didn't exist or had issues)
I made a version that would work with zero-copy on python>3.11 #567
As you will see this method contains unsafe code, exactly because there is no guarantee that the data won't change under us (including being freed). There might be ways to get the underlying buffer without unsafe code. bytes on the other hand is immutable and therefore safe. If any reader has good ideas on how to remove this unsafe block, I'm taking it.
On the other hand for Python < 3.11, I'm not sure we want to enable that. Those are legacy versions, and maintaining non abi3 compliant wheels is quite annoying.
System Info
when operating on huge tensors, profiling shows a ton of time spent in
__memmove_avx512_unaligned_erms
this is because the interface for saving safetensors requires a
PyBytes
here:safetensors/bindings/python/src/lib.rs
Line 31 in ea1a2d0
which means even though
torch.Tensor
is zero-copied intonp.array
:safetensors/bindings/python/py_src/safetensors/torch.py
Lines 435 to 439 in ea1a2d0
after, it is copied into a python string here:
safetensors/bindings/python/py_src/safetensors/torch.py
Line 460 in ea1a2d0
changing this to:
results in:
can a zero-copy path for saving tensors be added?
Information
Reproduction
save_file
with some tensors2a. observe
perf top
or other profiler2b. observe RSS memory usage
Expected behavior
memory usage should equal size of tensor data, not 2x
cpu should not spend any time making copy of memory (memcpy, memmove)
The text was updated successfully, but these errors were encountered: