[fix] mooncake: unpack dicts containing tensors to avoid bytes-pool f… by xupinjie · Pull Request #106 · Ascend/TransferQueue

xupinjie · 2026-05-20T15:41:49Z

Bug：
Storing dict values that contain tensors (e.g. Qwen3-VL's multi_modal_inputs) routes them through Mooncake's bytes pool, which silently returns b"" under MB-scale concurrent GET pressure and crashes training.

Fix：
The client now splits such dicts so each sub-tensor rides the working RDMA tensor path under a synthetic sub-key (any non-tensor entries are pickled into one uint8 blob that also rides RDMA), so the buggy bytes pool is never touched.

Res:

…ailure

ascend-robot · 2026-05-20T15:42:00Z

CLA Signature Guide

@xupinjie , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit	Reason
[`f18bc70` [fix] mooncake: unpack dicts co...](`f18bc70`)	the email used in the commit is not linked to a signed CLA! please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

0oshowero0 · 2026-05-21T01:27:19Z

 except ImportError:
    MOONCAKE_STORE_IMPORTED = False

+from tensordict import NonTensorData as _NonTensorData


Why we need this rename?

Copilot

Pull request overview

This PR fixes a MooncakeStore failure mode where storing dict values containing tensors can route data through Mooncake’s bytes pool, which may return b"" under high concurrent GET pressure and crash training. The client now “unpacks” dicts-with-tensors into multiple synthetic sub-keys so tensor payloads always use the tensor RDMA path, and adds tests to validate round-trip behavior and metadata handling.

Changes:

Add dict-with-tensor fan-out in MooncakeStoreClient.put() and corresponding re-folding logic in get() using per-key custom_backend_meta.
Add expanded-key deletion logic in clear() to remove dict sub-keys (including the bundled extras blob).
Add a comprehensive new test suite covering helper logic, end-to-end round-trip via a fake Mooncake store, and metadata serialization behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`transfer_queue/storage/clients/mooncake_client.py`	Implements dict unpacking into tensor-path sub-keys, reconstruction in `get()`, and expanded deletion in `clear()`.
`tests/test_mooncake_dict_unpack.py`	Adds unit + end-to-end tests validating dict unpack/repack, metadata shape, and clear/get behaviors.

Comments suppressed due to low confidence (3)

transfer_queue/storage/clients/mooncake_client.py:52

The synthetic sub-key scheme reserves __tq_extras__ for the bundled non-tensor blob. If an input dict contains a tensor entry whose key equals this reserved name, it will collide with the extras-blob key ({key}::__tq_extras__) and overwrite/corrupt data. Add validation/escaping for dict keys (at least reject reserved names like __tq_extras__ and possibly also guard against _TQ_DICT_UNPACK_KEY).

# Separator joining an original key to a dict sub-key (e.g. "5@mmi::pixel_values").
_DICT_SUBKEY_SEP: str = "::"

# Sentinel marker key identifying a per-key dict-unpack meta entry.
_TQ_DICT_UNPACK_KEY: str = "__tq_dict_unpack__"

# Reserved sub-key name for the bundled non-tensor blob (a 1D uint8 tensor that
# carries pickle bytes of all non-tensor entries of the original dict).
_TQ_EXTRAS_SUBKEY: str = "__tq_extras__"

transfer_queue/storage/clients/mooncake_client.py:241

custom_backend_meta currently stores torch.dtype objects (see tensor_dtypes). transfer_queue.utils.serial_utils.encode() explicitly falls back to pickle when payloads contain torch.dtype, which can negate the intended msgpack round-trip and add overhead. Consider storing dtypes as simple msgpack-native values (e.g., dtype name strings) and converting back to torch.dtype in get.

                custom_meta[i] = {
                    _TQ_DICT_UNPACK_KEY: True,
                    "key_order": key_order,
                    "tensor_keys": ts_sub_keys,
                    "tensor_dtypes": [t.dtype for t in ts_sub_tensors],
                    "tensor_shapes": [list(t.shape) for t in ts_sub_tensors],
                    "extras_size": extras_size,

tests/test_mooncake_dict_unpack.py:605

This test’s rationale claims the dict-unpack meta survives a msgspec/msgpack controller round-trip because it’s a plain dict. However, the meta includes torch.dtype values, and serial_utils.encode() documents that it falls back to pickle when torch.dtype is present. Either adjust the explanation/expectations, or store msgpack-native dtype representations (e.g., strings/ints) so the round-trip is truly msgpack-based.

    def test_meta_survives_tq_msgpack_pipeline(self):
        """REGRESSION: an earlier implementation made the dict-unpack meta a
        ``@dataclass``, which msgspec auto-flattened into a typeless dict on
        the controller round-trip; ``isinstance`` checks then failed at GET
        and the bytes-pool fallback re-triggered the original bug. Using a
        plain ``dict`` with a sentinel key sidesteps the issue — dicts are a
        native msgpack map type, so the structure (including the
        ``_TQ_DICT_UNPACK_KEY`` marker and all fields) round-trips
        losslessly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 except ImportError:
    MOONCAKE_STORE_IMPORTED = False

+from tensordict import NonTensorData as _NonTensorData


+    def test_non_tensor_data_wrapped_dict_is_true(self):
+        """The KV storage manager hands the client NonTensorData-wrapped dicts;
+        the dict-unpack path must unwrap them before classification."""
+        try:
+            from tensordict import NonTensorData
+        except ImportError:
+            pytest.skip("tensordict not installed in this env")
+        v = NonTensorData({"a": torch.zeros(3), "b": torch.ones(2, 4)})
+        assert _dict_has_tensor(v)


0oshowero0 · 2026-05-21T01:41:19Z

+    )
+
+
+def _expand_dict_slots_fn(


_flatten_dict_slots or _unpack_dict_slots?

0oshowero0 · 2026-05-21T01:42:08Z

+            extras_idx = -1
+            extras_size = meta.get("extras_size", 0)
+            if extras_size > 0:
+                flat_keys.append(f"{key}{_DICT_SUBKEY_SEP}{_TQ_EXTRAS_SUBKEY}")
+                flat_shapes.append([extras_size])
+                flat_dtypes.append(torch.uint8)
+                extras_idx = len(flat_keys) - 1


Need some comments to explain here

0oshowero0 · 2026-05-21T01:42:23Z

+            flat_shapes.append(shapes[i])
+            flat_dtypes.append(dtypes[i])
+            reconstruct.append(("scalar", len(flat_keys) - 1))
+    return flat_keys, flat_shapes, flat_dtypes, reconstruct


reconstruct -> rebuild_plan?

0oshowero0 · 2026-05-21T01:44:16Z

        if len(keys) != len(values):
            raise ValueError("Number of keys must match number of values")

+        custom_meta: list[Any] = [None] * len(keys)


custom_meta -> custom_backend_meta because the structures are different for these two types of meta

The first one is per-sample while the second one is per-sample-per-field

0oshowero0 · 2026-05-21T01:46:25Z

+                # Dict-with-tensor fan-out: avoid the Mooncake bytes pool which
+                # silently returns b"" under MB-scale GET pressure (see
+                # real_client.cpp:2209 "Failed to allocate buffer"). Each


The background info is not needed.

0oshowero0 · 2026-05-21T02:01:01Z

+        ``shapes`` and ``dtypes`` describe the expected tensor layout per key
+        (use ``None`` for non-tensor slots). ``custom_backend_meta`` carries
+        per-key metadata returned by ``put``. Returns values in input order.


Suggest using previous format for docstring

0oshowero0 · 2026-05-21T02:24:10Z

In general, I think the current implementation will be difficult to maintain. I plan to move the serialization logic in the Yuanrong client (https://github.com/Ascend/TransferQueue/blob/main/transfer_queue/storage/clients/yuanrong_client.py#L288) to a higher level. This will allow it to be shared by all storage backends, utilizing the common serial_utils.py for serialization.

[fix] mooncake: unpack dicts containing tensors to avoid bytes-pool f…

f18bc70

…ailure

ascend-robot added the ascend-cla/no label May 20, 2026

0oshowero0 requested a review from Copilot May 21, 2026 01:24

Copilot started reviewing on behalf of 0oshowero0 May 21, 2026 01:25 View session

0oshowero0 reviewed May 21, 2026

View reviewed changes

Copilot AI reviewed May 21, 2026

View reviewed changes

0oshowero0 reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] mooncake: unpack dicts containing tensors to avoid bytes-pool f…#106

[fix] mooncake: unpack dicts containing tensors to avoid bytes-pool f…#106
xupinjie wants to merge 1 commit into
Ascend:mainfrom
xupinjie:pinjie/fix_multi_modal_inputs

xupinjie commented May 20, 2026

Uh oh!

ascend-robot commented May 20, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 May 21, 2026

Uh oh!

0oshowero0 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xupinjie commented May 20, 2026

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Guide

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0oshowero0 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants