[doc] Adjusted yuanrong backend doc#104
Conversation
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
5e20c9d to
a80b636
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <[email protected]>
a80b636 to
e353874
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <[email protected]>
41f4a62 to
0c31647
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| ## Quick Start | ||
|
|
||
| ### Prerequisites | ||
| - **Python Version**: $ \geq 3.10~and \leq 3.11 $ |
There was a problem hiding this comment.
this line is not correctly rendered by markdown. just use >= 3.10, <=3.11
| - `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker. | ||
| - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). | ||
| - `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B. | ||
| - `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B. |
There was a problem hiding this comment.
I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html
|
|
||
| Next, we will provide deployment and code examples for single-node scenarios. | ||
| For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios). | ||
| When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process: |
There was a problem hiding this comment.
it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client
| **NPU Transfer Options:** | ||
| - `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors. | ||
| - `worker_args` (recommended when `enable_yr_npu_transport: true`): | ||
| - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). |
There was a problem hiding this comment.
yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.
|
|
||
| ```bash | ||
| # On head node | ||
| ray start --head --resources='{"node:192.168.0.1": 1}' |
There was a problem hiding this comment.
I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified
There was a problem hiding this comment.
This is for controlling placements of ray actors
| TransferQueue will detect all Ray nodes and deploy datasystem workers automatically. | ||
|
|
||
| Once the configuration is set, you can run your TransferQueue + Datasystem application directly. | ||
| #### Multi-Node Demo |
There was a problem hiding this comment.
add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code
| If `worker_port` or `metastore_port` is already in use, initialization will fail: | ||
|
|
||
| ``` | ||
| RuntimeError: Failed to start datasystem worker... |
There was a problem hiding this comment.
port conflict is the only possible reason of failed to start datasystem worker?
| # Clean up | ||
| dscli stop --worker_address <IP>:31501 | ||
| # Or force cleanup | ||
| pkill -f dscli |
There was a problem hiding this comment.
kill dscli or kill datasystem_worker?
| pkill -f dscli | ||
| ``` | ||
|
|
||
| ### Multi-Process Initialization |
KaisennHu
left a comment
There was a problem hiding this comment.
Overall looks good. Some minors.
|
|
||
| **NPU Transfer Options:** | ||
| - `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors. | ||
| - `worker_args` (recommended when `enable_yr_npu_transport: true`): |
There was a problem hiding this comment.
When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.
| 1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster | ||
| 2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes | ||
| 3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker | ||
| 4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers |
There was a problem hiding this comment.
The symbols ‘-’ are a bit strange
| # On head node | ||
| ray start --head --resources='{"node:192.168.0.1": 1}' | ||
|
|
||
| # On worker node (assume ray port of head_node is 6379) | ||
| ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}' |
There was a problem hiding this comment.
To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.
Description
I've updated the description in the Yuanrong backend documentation, adding more usage guidance.
Main changes
transfer_queue.init()to startTransferQueue&Yuanrong.auto_init=False.