Skip to content

[doc] Adjusted yuanrong backend doc#104

Open
dpj135 wants to merge 2 commits into
Ascend:mainfrom
dpj135:fix_yr_init_and_doc
Open

[doc] Adjusted yuanrong backend doc#104
dpj135 wants to merge 2 commits into
Ascend:mainfrom
dpj135:fix_yr_init_and_doc

Conversation

@dpj135
Copy link
Copy Markdown
Contributor

@dpj135 dpj135 commented May 18, 2026

Description

I've updated the description in the Yuanrong backend documentation, adding more usage guidance.

Main changes

  • Add more detailed descriptions regarding installation and usage.
  • Adjust demos and use transfer_queue.init() to start TransferQueue&Yuanrong.
  • Add instructions for manually launching Yuanrong when auto_init=False.
  • Add FAQ to record common issues during the use of Yuanrong.

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from 5e20c9d to a80b636 Compare May 19, 2026 11:01
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from a80b636 to e353874 Compare May 20, 2026 04:42
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 force-pushed the fix_yr_init_and_doc branch from 41f4a62 to 0c31647 Compare May 20, 2026 07:56
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 marked this pull request as ready for review May 20, 2026 07:56
## Quick Start

### Prerequisites
- **Python Version**: $ \geq 3.10~and \leq 3.11 $
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not correctly rendered by markdown. just use >= 3.10, <=3.11

- `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
- `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B.
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html


Next, we will provide deployment and code examples for single-node scenarios.
For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios).
When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client

**NPU Transfer Options:**
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.


```bash
# On head node
ray start --head --resources='{"node:192.168.0.1": 1}'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for controlling placements of ray actors

TransferQueue will detect all Ray nodes and deploy datasystem workers automatically.

Once the configuration is set, you can run your TransferQueue + Datasystem application directly.
#### Multi-Node Demo
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code

If `worker_port` or `metastore_port` is already in use, initialization will fail:

```
RuntimeError: Failed to start datasystem worker...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port conflict is the only possible reason of failed to start datasystem worker?

# Clean up
dscli stop --worker_address <IP>:31501
# Or force cleanup
pkill -f dscli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill dscli or kill datasystem_worker?

pkill -f dscli
```

### Multi-Process Initialization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an FAQ?

Copy link
Copy Markdown
Collaborator

@KaisennHu KaisennHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Some minors.


**NPU Transfer Options:**
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.

Comment on lines +125 to +128
1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbols ‘-’ are a bit strange

Comment on lines +165 to +169
# On head node
ray start --head --resources='{"node:192.168.0.1": 1}'

# On worker node (assume ray port of head_node is 6379)
ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants