[doc] Adjusted yuanrong backend doc by dpj135 · Pull Request #104 · Ascend/TransferQueue

dpj135 · 2026-05-18T12:34:13Z

Description

I've updated the description in the Yuanrong backend documentation, adding more usage guidance.

Main changes

Add more detailed descriptions regarding installation and usage.
Adjust demos and use transfer_queue.init() to start TransferQueue&Yuanrong.
Add instructions for manually launching Yuanrong when auto_init=False.
Add FAQ to record common issues during the use of Yuanrong.

ascend-robot · 2026-05-18T12:34:24Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-05-19T11:01:15Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <[email protected]>

ascend-robot · 2026-05-20T04:43:07Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-05-20T06:47:41Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <[email protected]>

ascend-robot · 2026-05-20T07:56:28Z

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

tianyi-ge · 2026-05-20T08:26:56Z

+## Quick Start

 ### Prerequisites
 - **Python Version**: $ \geq 3.10~and \leq 3.11 $


this line is not correctly rendered by markdown. just use >= 3.10, <=3.11

tianyi-ge · 2026-05-20T08:31:39Z

  - `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
-  - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
-  - `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B.
+  - `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.


I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html

tianyi-ge · 2026-05-20T08:33:51Z


-Next, we will provide deployment and code examples for single-node scenarios.
-For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios).
+When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:


it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client

tianyi-ge · 2026-05-20T08:36:42Z

+**NPU Transfer Options:**
+- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
+- `worker_args` (recommended when `enable_yr_npu_transport: true`):
+  - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).


yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.

tianyi-ge · 2026-05-20T08:38:02Z

+
+```bash
+# On head node
+ray start --head --resources='{"node:192.168.0.1": 1}'


I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified

This is for controlling placements of ray actors

tianyi-ge · 2026-05-20T08:39:25Z

+TransferQueue will detect all Ray nodes and deploy datasystem workers automatically.

-Once the configuration is set, you can run your TransferQueue + Datasystem application directly.
+#### Multi-Node Demo


add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code

tianyi-ge · 2026-05-20T08:40:34Z

+If `worker_port` or `metastore_port` is already in use, initialization will fail:
+
+```
+RuntimeError: Failed to start datasystem worker...


port conflict is the only possible reason of failed to start datasystem worker?

tianyi-ge · 2026-05-20T08:41:28Z

+# Clean up
+dscli stop --worker_address <IP>:31501
+# Or force cleanup
+pkill -f dscli


kill dscli or kill datasystem_worker?

tianyi-ge · 2026-05-20T08:42:49Z

+pkill -f dscli
+```
+
+### Multi-Process Initialization


why is this an FAQ?

KaisennHu

Overall looks good. Some minors.

KaisennHu · 2026-05-20T08:36:39Z

+
+**NPU Transfer Options:**
+- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
+- `worker_args` (recommended when `enable_yr_npu_transport: true`):


When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.

KaisennHu · 2026-05-20T08:47:34Z

+1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
+2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
+3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
+4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers


The symbols ‘-’ are a bit strange

KaisennHu · 2026-05-20T08:51:06Z

+# On head node
+ray start --head --resources='{"node:192.168.0.1": 1}'
+
+# On worker node (assume ray port of head_node is 6379)
+ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'


To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.

ascend-robot added the ascend-cla/yes label May 18, 2026

dpj135 force-pushed the fix_yr_init_and_doc branch from 5e20c9d to a80b636 Compare May 19, 2026 11:01

Adjusted yuanrong backend doc

e353874

Signed-off-by: dpj135 <[email protected]>

dpj135 force-pushed the fix_yr_init_and_doc branch from a80b636 to e353874 Compare May 20, 2026 04:42

Used kv interface(Higher API)

0c31647

Signed-off-by: dpj135 <[email protected]>

dpj135 force-pushed the fix_yr_init_and_doc branch from 41f4a62 to 0c31647 Compare May 20, 2026 07:56

dpj135 marked this pull request as ready for review May 20, 2026 07:56

tianyi-ge reviewed May 20, 2026

View reviewed changes

KaisennHu reviewed May 20, 2026

View reviewed changes

Conversation

dpj135 commented May 18, 2026

Description

Main changes

Uh oh!

ascend-robot commented May 18, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 19, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented May 20, 2026

CLA Signature Pass

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaisennHu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants