Add POSIX shared memory transport for multi-worker dataloaders by dfalbel · Pull Request #1456 · mlverse/torch

dfalbel · 2026-04-23T17:26:56Z

Summary

Adds a new IPC transport option for multi-worker dataloaders using POSIX shared memory (shm_open/mmap), enabled via options(torch.dataloader_use_shm = TRUE).
The SHM path performs 1 memcpy on the producer (tensor → SHM) and 0 copies on the consumer (from_blob aliases the mapped memory), compared to serialize + pipe transfer + deserialize in the default path.
Adds r_dataptr_ro() in the C++ layer to avoid triggering ALTREP copy-on-write when reading buffers via torch::from_blob.

Benchmark results (excluding worker startup)

Config	Default	SHM	Speedup	MB/batch
500x1K, bs=32	0.234s	0.175s	1.34x	0.1
200x50K, bs=64	0.113s	0.063s	1.79x	12.2
200x100K, bs=64	0.129s	0.090s	1.43x	24.4
100x500K, bs=64	0.050s	0.022s	2.27x	122.1
100x1M, bs=32	0.736s	0.425s	1.73x	122.1

Test plan

Verified correctness: values match between single-process and SHM multi-worker dataloaders
Tested full iteration (all batches including exhaustion signal)
Tested with mixed dtypes (Float, Long) and nested list batches
Default (non-SHM) dataloader behavior is unchanged
CI benchmark runs and posts results to step summary

When `options(torch.dataloader_use_mori = TRUE)`, the dataloader uses POSIX shared memory (shm_open/mmap) instead of callr's pipe or socket connections for transferring tensor data between worker processes and the main process. The SHM path does a single memcpy (tensor → SHM) on the producer and zero copies on the consumer (from_blob aliases the mapped memory), compared to serialize + pipe transfer + deserialize in the default path. Changes: - src/tensor.cpp: Add cpp_tensor_to_shm() and cpp_map_shm() for direct POSIX SHM operations. Add r_dataptr_ro() to avoid ALTREP COW when reading buffers via from_blob. - R/utils-data-dataloader.R: Add SHM transport as a third option alongside callr pipe and socket connections, controlled by the torch.dataloader_use_mori option. - bench/bench-shm.R: Benchmark comparing default vs SHM transport. - .github/workflows/main.yaml: Add SHM benchmark to CI.

The implementation uses POSIX shared memory directly, not mori. Rename option to torch.dataloader_use_shm and internal variables accordingly.

The Rcpp-generated exports need function definitions on all platforms. Move the _WIN32 guard inside each function body so the symbols are always defined, but error at runtime on Windows. Skip SHM tests on Windows.

SHM is now the default for multi-worker dataloaders on non-Windows systems. Users can revert with options(torch.dataloader_use_shm = FALSE). The test file runs the full dataloader test suite twice: once with SHM disabled (legacy path) and once with SHM enabled.

- Map SHM with PROT_READ|PROT_WRITE + MAP_SHARED (MAP_PRIVATE is unsupported on macOS with POSIX SHM). Since we shm_unlink immediately after mapping, we are the sole owner and writes are safe. - SHM is now the default on non-Windows systems. - Add regression test for in-place tensor ops on SHM-backed batches. - Test both SHM-enabled and SHM-disabled paths. - Add NEWS bullet documenting the change and opt-out.

When iteration stops early (break, error, or dropping the iterator), prefetched tasks may have SHM segments that were never mapped by the consumer. The finalizer now drains outstanding tasks and calls shm_unlink on their segments to prevent /dev/shm leaks. Also fixes MAP_PRIVATE unsupported on macOS — uses MAP_SHARED with immediate shm_unlink to orphan the region, making writes safe since we are the sole owner.

Adds cpp_shm_exists() helper for tests. The test creates SHM segments and verifies shm_unlink_recursive cleans them up — fails without the finalize drain logic, passes with it.

mmap with size 0 is invalid on POSIX. Skip SHM creation for empty tensors (numel == 0) and reconstruct them directly on the consumer side. Adds regression test that exercises a dataset returning torch_tensor(numeric(0)) through the SHM dataloader.

When a custom collate_fn returns non-tensor, non-list objects (e.g. character vectors, scalars), the SHM path passes them through unchanged. from_exportable_tensor now only attempts rawConnection deserialization on raw vectors/connections, and passes everything else through as-is.

Use attributes(out) <- attributes(x) instead of a marker class to preserve names, custom classes, and other attributes through the SHM tensor replacement. from_exportable_tensor now always tries tensors_from_shared (no-op for non-shared objects) instead of checking for torch_shared_batch. Move SHM-specific regression tests (custom collate with non-tensor output, list class preservation) to the SHM test file since the old serialization path doesn't support these scenarios.

Replace cpp_map_shm + attribute-based lifetime with cpp_tensor_from_shm which does the full consumer path in C++: open → mmap → from_blob → clone → munmap. The clone gives the tensor its own storage, so derived tensors ($view, $transpose, etc.) share the cloned storage rather than aliasing a fragile SHM mapping. No external pointer or attribute needed. Total cost is still 2 memcpys (1 producer + 1 consumer clone), which is faster than the default serialize + pipe + deserialize path. Adds regression test: derive a tensor via $view(), GC the original batch, verify the derived tensor is still valid.

The SHM descriptor now records requires_grad alongside shape and dtype. tensors_from_shared calls requires_grad_(TRUE) on reconstruction when the flag was set. Adds regression test with a dataset that returns both grad-enabled and grad-disabled tensors.

Memoize tensors_to_shared by xptr_address and tensors_from_shared by SHM name so that duplicate tensor references in a batch (e.g. from a custom collate returning list(a = t, b = t)) reconstruct to the same tensor, preserving storage sharing and in-place mutation visibility.

In containerized environments where /dev/shm is capped (e.g. Docker's default 64MB), shm_open/ftruncate can fail for large batches. to_exportable_tensor now catches SHM errors and falls back to the previous pipe-based serialization path transparently.

Check use_socket_con() before use_shm() so that users who set options(torch.dataloader_use_socket_con = TRUE) get socket transport as expected, without needing to also disable SHM.

Compute the memoization key from xptr_address(x) before calling contiguous(), which allocates a fresh tensor and changes the address. This preserves aliasing when a custom collate returns the same non-contiguous view in multiple slots.

If tensors_to_shared() fails after converting some tensors (e.g. /dev/shm fills up mid-batch), the error handler now walks the memo env and unlinks all segments created so far before re-raising. The outer tryCatch in to_exportable_tensor then catches the error and falls back to serialization without leaking SHM.

as.character() on some torch dtypes returns C++ names (Char, Byte, ComplexFloat, ComplexDouble) that torch_tensor_from_buffer doesn't accept. Add dtype_to_shm_string() mapping to canonical names (int8, uint8, cfloat, cdouble). Adds regression test covering all standard dtypes through the multiworker dataloader.

When shape is empty (scalar tensor), &shape[0] is undefined behavior. Pass nullptr instead, matching the pattern used by create_tensor_from_atomic. Adds regression test with a custom collate that reduces a batch to a scalar.

Change poll_process timeout from 5s to -1 (indefinite) in the finalizer's task drain loop. A timeout would skip read() and shm_unlink_recursive(), leaking SHM segments for slow datasets. The worker will either finish or die when the session is closed.

dfalbel added 3 commits April 23, 2026 14:26

Rename mori references to shm

0afc8b9

The implementation uses POSIX shared memory directly, not mori. Rename option to torch.dataloader_use_shm and internal variables accordingly.

Add test that re-runs all dataloader tests with SHM transport

996e198

dfalbel force-pushed the victorious-carbon branch from db1944c to 996e198 Compare April 23, 2026 17:44

dfalbel added 18 commits April 23, 2026 14:46

Fix Windows build: move #ifdef inside function bodies

af05e5e

The Rcpp-generated exports need function definitions on all platforms. Move the _WIN32 guard inside each function body so the symbols are always defined, but error at runtime on Windows. Skip SHM tests on Windows.

Add regression test for SHM cleanup on early iteration stop

92557a1

Adds cpp_shm_exists() helper for tests. The test creates SHM segments and verifies shm_unlink_recursive cleans them up — fails without the finalize drain logic, passes with it.

Respect explicit socket-transport override over SHM

a5982d8

Check use_socket_con() before use_shm() so that users who set options(torch.dataloader_use_socket_con = TRUE) get socket transport as expected, without needing to also disable SHM.

Guard 0-dim tensors before taking &shape[0] in SHM path

d09dfb4

When shape is empty (scalar tensor), &shape[0] is undefined behavior. Pass nullptr instead, matching the pattern used by create_tensor_from_atomic. Adds regression test with a custom collate that reduces a batch to a scalar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add POSIX shared memory transport for multi-worker dataloaders#1456

Add POSIX shared memory transport for multi-worker dataloaders#1456
dfalbel wants to merge 21 commits intomainfrom
victorious-carbon

dfalbel commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dfalbel commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results (excluding worker startup)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dfalbel commented Apr 23, 2026 •

edited

Loading