Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving tensors should be zero-copy #564

Open
1 of 2 tasks
tmm1 opened this issue Jan 29, 2025 · 1 comment
Open
1 of 2 tasks

Saving tensors should be zero-copy #564

tmm1 opened this issue Jan 29, 2025 · 1 comment

Comments

@tmm1
Copy link

tmm1 commented Jan 29, 2025

System Info

when operating on huge tensors, profiling shows a ton of time spent in __memmove_avx512_unaligned_erms

this is because the interface for saving safetensors requires a PyBytes here:

data: PyBound<'a, PyBytes>,

which means even though torch.Tensor is zero-copied into np.array:

ptr = tensor.data_ptr()
if ptr == 0:
return b""
newptr = ctypes.cast(ptr, ctypes.POINTER(ctypes.c_ubyte))
data = np.ctypeslib.as_array(newptr, (total_bytes,)) # no internal copy

after, it is copied into a python string here:

changing this to:

 def _tobytes(tensor: torch.Tensor, name: str) -> bytes:
     ...
-    return data.tobytes()
+    return data.data

results in:

TypeError: 'memoryview' object cannot be converted to 'PyBytes'

can a zero-copy path for saving tensors be added?

Information

  • The official example scripts
  • My own modified scripts

Reproduction

  1. call save_file with some tensors
    2a. observe perf top or other profiler
    2b. observe RSS memory usage

Expected behavior

  • memory usage should equal size of tensor data, not 2x

  • cpu should not spend any time making copy of memory (memcpy, memmove)

@Narsil
Copy link
Collaborator

Narsil commented Feb 4, 2025

Do you mind sharing why consuming 2x memory is an issue for you ? Adding context is likely to help others as well.
In general for GPU, the CPU RAM is more than well equipped with dealing with this. I'm asking because usually this extra copy can be made "free" because CPU and GPU work overlap.

There's a historical reason for doing this copy.

PyBuffer wasn't stabilized (in ABI) until 3.11 which came after this lib was created. (And also I'm pretty sure pyo3 support for it didn't exist or had issues)
I made a version that would work with zero-copy on python>3.11 #567

As you will see this method contains unsafe code, exactly because there is no guarantee that the data won't change under us (including being freed). There might be ways to get the underlying buffer without unsafe code. bytes on the other hand is immutable and therefore safe. If any reader has good ideas on how to remove this unsafe block, I'm taking it.

On the other hand for Python < 3.11, I'm not sure we want to enable that. Those are legacy versions, and maintaining non abi3 compliant wheels is quite annoying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants