Skip to content

[doc] Adjusted yuanrong backend doc#104

Open
dpj135 wants to merge 3 commits into
Ascend:mainfrom
dpj135:fix_yr_init_and_doc
Open

[doc] Adjusted yuanrong backend doc#104
dpj135 wants to merge 3 commits into
Ascend:mainfrom
dpj135:fix_yr_init_and_doc

Conversation

@dpj135
Copy link
Copy Markdown
Contributor

@dpj135 dpj135 commented May 18, 2026

Description

I've updated the description in the Yuanrong backend documentation, adding more usage guidance.

Main changes

  • Add more detailed descriptions regarding installation and usage.
  • Adjust demos and use transfer_queue.init() to start TransferQueue&Yuanrong.
  • Add instructions for manually launching Yuanrong when auto_init=False.
  • Add FAQ to record common issues during the use of Yuanrong.

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from 5e20c9d to a80b636 Compare May 19, 2026 11:01
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>
@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from a80b636 to e353874 Compare May 20, 2026 04:42
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>
@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from 41f4a62 to 0c31647 Compare May 20, 2026 07:56
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 marked this pull request as ready for review May 20, 2026 07:56
## Quick Start

### Prerequisites
- **Python Version**: $ \geq 3.10~and \leq 3.11 $
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not correctly rendered by markdown. just use >= 3.10, <=3.11

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
- `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B.
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Next, we will provide deployment and code examples for single-node scenarios.
For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios).
When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

**NPU Transfer Options:**
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


```bash
# On head node
ray start --head --resources='{"node:192.168.0.1": 1}'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for controlling placements of ray actors

TransferQueue will detect all Ray nodes and deploy datasystem workers automatically.

Once the configuration is set, you can run your TransferQueue + Datasystem application directly.
#### Multi-Node Demo
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

If `worker_port` or `metastore_port` is already in use, initialization will fail:

```
RuntimeError: Failed to start datasystem worker...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port conflict is the only possible reason of failed to start datasystem worker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more situations

# Clean up
dscli stop --worker_address <IP>:31501
# Or force cleanup
pkill -f dscli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill dscli or kill datasystem_worker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkill -f dscli
```

### Multi-Process Initialization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an FAQ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may be confused about how to init yuanrong-worker with multiple processes. This is for explaining the process of tq.init()

Copy link
Copy Markdown
Collaborator

@KaisennHu KaisennHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Some minors.


**NPU Transfer Options:**
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +125 to +128
1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbols ‘-’ are a bit strange

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks not bad. (^w^)

Comment on lines +165 to +169
# On head node
ray start --head --resources='{"node:192.168.0.1": 1}'

# On worker node (assume ray port of head_node is 6379)
ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Yuanrong storage-backend documentation to provide clearer installation, configuration, deployment, and troubleshooting guidance, and links the guide from the main README.

Changes:

  • Added a README link to the Yuanrong usage guide.
  • Restructured and expanded the Yuanrong backend guide with demos (single-node + multi-node), config explanations, and manual startup instructions.
  • Added an FAQ section covering common deployment/runtime issues.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
README.md Adds a direct link to the Yuanrong backend usage guide from the supported backends list.
docs/storage_backends/openyuanrong_datasystem.md Expands and reorganizes Yuanrong backend documentation (install, demos, deployment, manual mode, FAQ).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 40 to 41
# Install Torch (recommended version: 2.8.0 or higher)
pip install torch==2.8.0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra annotation

Comment on lines 70 to 75
# For root users
ll /usr/local/Ascend/ascend-toolkit/latest

# For non-root users
ll ${HOME}/Ascend/ascend-toolkit/latest
```
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### Option 1: Docker Image (Recommended)

First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/cann) aligned with your **CANN version**, **Ascend hardware**, **OS**, and **Python version**. For examples:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TQ already has set openyuanrong-datasystem as optional dependency. We can use pip install TransferQueue[yuanrong] to directly install corresponding openyuanrong-datasystem

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: dpj135 <958208521@qq.com>
@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from bcd05d4 to 88b8591 Compare May 22, 2026 09:25
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants