Saving tensors should be zero-copy #564

tmm1 · 2025-01-29T05:48:15Z

System Info

when operating on huge tensors, profiling shows a ton of time spent in __memmove_avx512_unaligned_erms

this is because the interface for saving safetensors requires a PyBytes here:

safetensors/bindings/python/src/lib.rs

Line 31 in ea1a2d0

data: PyBound<'a, PyBytes>,

which means even though torch.Tensor is zero-copied into np.array:

safetensors/bindings/python/py_src/safetensors/torch.py

Lines 435 to 439 in ea1a2d0

    
           ptr = tensor.data_ptr() 
        
           if ptr == 0: 
        
               return b"" 
        
           newptr = ctypes.cast(ptr, ctypes.POINTER(ctypes.c_ubyte)) 
        
           data = np.ctypeslib.as_array(newptr, (total_bytes,))  # no internal copy

after, it is copied into a python string here:

safetensors/bindings/python/py_src/safetensors/torch.py

Line 460 in ea1a2d0

return data.tobytes()

changing this to:

 def _tobytes(tensor: torch.Tensor, name: str) -> bytes:
     ...
-    return data.tobytes()
+    return data.data

results in:

TypeError: 'memoryview' object cannot be converted to 'PyBytes'

can a zero-copy path for saving tensors be added?

Information

The official example scripts
My own modified scripts

Reproduction

call save_file with some tensors
2a. observe perf top or other profiler
2b. observe RSS memory usage

Expected behavior

memory usage should equal size of tensor data, not 2x
cpu should not spend any time making copy of memory (memcpy, memmove)

The text was updated successfully, but these errors were encountered:

Narsil · 2025-02-04T11:59:21Z

Do you mind sharing why consuming 2x memory is an issue for you ? Adding context is likely to help others as well.
In general for GPU, the CPU RAM is more than well equipped with dealing with this. I'm asking because usually this extra copy can be made "free" because CPU and GPU work overlap.

There's a historical reason for doing this copy.

PyBuffer wasn't stabilized (in ABI) until 3.11 which came after this lib was created. (And also I'm pretty sure pyo3 support for it didn't exist or had issues)
I made a version that would work with zero-copy on python>3.11 #567

As you will see this method contains unsafe code, exactly because there is no guarantee that the data won't change under us (including being freed). There might be ways to get the underlying buffer without unsafe code. bytes on the other hand is immutable and therefore safe. If any reader has good ideas on how to remove this unsafe block, I'm taking it.

On the other hand for Python < 3.11, I'm not sure we want to enable that. Those are legacy versions, and maintaining non abi3 compliant wheels is quite annoying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving tensors should be zero-copy #564

Saving tensors should be zero-copy #564

tmm1 commented Jan 29, 2025 •

edited

Loading

Narsil commented Feb 4, 2025 •

edited

Loading

Saving tensors should be zero-copy #564

Saving tensors should be zero-copy #564

Comments

tmm1 commented Jan 29, 2025 • edited Loading

System Info

Information

Reproduction

Expected behavior

Narsil commented Feb 4, 2025 • edited Loading

tmm1 commented Jan 29, 2025 •

edited

Loading

Narsil commented Feb 4, 2025 •

edited

Loading