[CI] Add multi_node CI #2749

Potabk · 2025-09-04T08:29:55Z

What this PR does / why we need it?

The purpose of this PR is to add multi-nodes CI integrating with k8s, for now, with 2 * 16 A3 nodes testing deepseek multi-dp, and I will explain the main implementation ideas:

Use github action as the entry point(a self_hosted cpu runner to control the cluster)
The cpu runner has the authority to control the creation/deletion of cluster resources under the limited namespace, so we can just use kubectl apply -f lws.yaml to launch the server
The cpu runner has access to communicate with the cluster, once the serving is ready routed by the Cluster-IP(The leader node's 8080 port service will be exposed to the cluster), we can perform benchmark testing or other tests on the service
How the pods communicating？
The env LWS_LEADER_ADDRESS is injected in advance before the pod starts, where the worker pod can access the leader node's IP

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@6c7af81

Signed-off-by: wangli <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a multi-node continuous integration (CI) setup using Kubernetes and LeaderWorkerSet for testing vLLM on Ascend NPUs. The changes include new benchmark dependencies, an installer script, launch scripts for leader/worker nodes, and a Kubernetes manifest. While this is a valuable addition for ensuring multi-node stability, my review has identified several critical and high-severity issues. There are critical flaws in the CI logic that prevent it from testing the actual code changes in a pull request. Additionally, a key script for setting up a Ray cluster contains a bug that will cause it to hang. Other issues include hardcoded values that reduce script robustness and a missing readiness probe in the Kubernetes configuration that could lead to flaky tests. These issues should be addressed to make the new CI pipeline reliable and effective.

gemini-code-assist · 2025-09-04T08:31:43Z

tests/e2e/multi_nodes/installer.sh

+checkout_src() {
+    echo "====> Checkout source code"
+    mkdir -p "$SRC_DIR"
+
+    # vllm-ascend
+    if [ ! -d "$SRC_DIR/vllm-ascend" ]; then
+        git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git "$SRC_DIR/vllm-ascend"
+    fi
+
+    # vllm
+    if [ ! -d "$SRC_DIR/vllm" ]; then
+        git clone -b v0.10.1.1 https://github.com/vllm-project/vllm.git "$SRC_DIR/vllm"
+    fi
+}


The checkout_src function clones vllm-ascend from its main repository. When running in a CI environment for a pull request, this will ignore the changes from the PR and test against the main branch instead. The CI system should be responsible for checking out the correct version of the code, and these scripts should use that version. This defeats the purpose of running CI on pull requests.

gemini-code-assist · 2025-09-04T08:31:43Z

tests/e2e/multi_nodes/multi_node_mp/lws.yaml

+            command:
+              - sh
+              - -c
+              - "bash /root/.cache/tests/multi_node_mp/launch_server_leader.sh"


The container's command executes a script from /root/.cache, which appears to be a path on a persistent volume. This is a critical flaw in a CI setup, as it means the CI is running a cached version of the script, not the one from the pull request. Changes to the launch scripts in a PR will not be tested. The same issue exists for the worker template (lines 72-75). The command should execute the script from the CI workspace where the PR's source code is checked out.

command: - sh - -c - "bash $WORKSPACE/tests/e2e/multi_nodes/multi_node_mp/launch_server_leader.sh"

gemini-code-assist · 2025-09-04T08:31:43Z

tests/e2e/multi_nodes/multi_node_ray/launch_server.sh

+
+    # Retry until the worker node connects to the head node or the timeout expires.
+    for (( i=0; i < $ray_init_timeout; i+=5 )); do
+      ray start --address=$ray_address:$ray_port --block "${start_params[@]}"


The worker subcommand uses ray start --block, which will cause the script to hang. The --block flag prevents the command from returning until the process is manually terminated. This means the retry loop will never continue, and the script will not function as intended.

Suggested change

ray start --address=$ray_address:$ray_port --block "${start_params[@]}"

ray start --address=$ray_address:$ray_port "${start_params[@]}"

gemini-code-assist · 2025-09-04T08:31:43Z

tests/e2e/multi_nodes/multi_node_mp/launch_server_leader.sh

+# this obtained through ifconfig
+# nic_name is the network interface name corresponding to local_ip
+local_ip=$(hostname -I | awk '{print $1}')
+nic_name=eth0


The network interface name nic_name is hardcoded to eth0. This is not guaranteed to be correct in all environments, including Kubernetes pods, which can lead to script failures if the interface has a different name. It's better to determine the interface name dynamically.

Suggested change

nic_name=eth0

nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}')

gemini-code-assist · 2025-09-04T08:31:44Z

tests/e2e/multi_nodes/multi_node_mp/launch_server_worker.sh

+#!/bin/bash
+
+local_ip=$(hostname -I | awk '{print $1}')
+nic_name=eth0


The network interface name nic_name is hardcoded to eth0. This is brittle and will fail in environments where the interface has a different name. The interface name should be determined dynamically based on the local IP address.

Suggested change

nic_name=eth0

nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}')

gemini-code-assist · 2025-09-04T08:31:44Z

tests/e2e/multi_nodes/multi_node_mp/lws.yaml

+            # readinessProbe:
+            #   tcpSocket:
+            #     port: 8080
+            #   initialDelaySeconds: 15
+            #   periodSeconds: 10


The readinessProbe for the leader container is commented out. Without it, Kubernetes cannot determine when the server is ready to accept requests. This can lead to flaky CI jobs if subsequent steps try to connect to the server before it's fully initialized. A readiness probe is essential for a reliable service.

readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 60 periodSeconds: 10 failureThreshold: 30

Signed-off-by: wangli <[email protected]>

github-actions · 2025-09-04T09:12:19Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

add multi_node ci

075fb3d

Signed-off-by: wangli <[email protected]>

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

fix path

0c27689

Signed-off-by: wangli <[email protected]>

github-actions bot added the module:tests label Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Add multi_node CI #2749

[CI] Add multi_node CI #2749

Potabk commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

Uh oh!

	ray start --address=$ray_address:$ray_port --block "${start_params[@]}"
	ray start --address=$ray_address:$ray_port "${start_params[@]}"

	nic_name=eth0
	nic_name=$(ip -o addr show \| awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}')

[CI] Add multi_node CI #2749

Are you sure you want to change the base?

[CI] Add multi_node CI #2749

Conversation

Potabk commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

Uh oh!

Potabk commented Sep 4, 2025 •

edited by github-actions bot

Loading