Skip to content

Conversation

Potabk
Copy link
Collaborator

@Potabk Potabk commented Sep 4, 2025

What this PR does / why we need it?

The purpose of this PR is to add multi-nodes CI integrating with k8s, for now, with 2 * 16 A3 nodes testing deepseek multi-dp, and I will explain the main implementation ideas:

  1. Use github action as the entry point(a self_hosted cpu runner to control the cluster)
  2. The cpu runner has the authority to control the creation/deletion of cluster resources under the limited namespace, so we can just use kubectl apply -f lws.yaml to launch the server
  3. The cpu runner has access to communicate with the cluster, once the serving is ready routed by the Cluster-IP(The leader node's 8080 port service will be exposed to the cluster), we can perform benchmark testing or other tests on the service
  4. How the pods communicating?
    The env LWS_LEADER_ADDRESS is injected in advance before the pod starts, where the worker pod can access the leader node's IP
image

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: wangli <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-node continuous integration (CI) setup using Kubernetes and LeaderWorkerSet for testing vLLM on Ascend NPUs. The changes include new benchmark dependencies, an installer script, launch scripts for leader/worker nodes, and a Kubernetes manifest. While this is a valuable addition for ensuring multi-node stability, my review has identified several critical and high-severity issues. There are critical flaws in the CI logic that prevent it from testing the actual code changes in a pull request. Additionally, a key script for setting up a Ray cluster contains a bug that will cause it to hang. Other issues include hardcoded values that reduce script robustness and a missing readiness probe in the Kubernetes configuration that could lead to flaky tests. These issues should be addressed to make the new CI pipeline reliable and effective.

Comment on lines +20 to +33
checkout_src() {
echo "====> Checkout source code"
mkdir -p "$SRC_DIR"

# vllm-ascend
if [ ! -d "$SRC_DIR/vllm-ascend" ]; then
git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git "$SRC_DIR/vllm-ascend"
fi

# vllm
if [ ! -d "$SRC_DIR/vllm" ]; then
git clone -b v0.10.1.1 https://github.com/vllm-project/vllm.git "$SRC_DIR/vllm"
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The checkout_src function clones vllm-ascend from its main repository. When running in a CI environment for a pull request, this will ignore the changes from the PR and test against the main branch instead. The CI system should be responsible for checking out the correct version of the code, and these scripts should use that version. This defeats the purpose of running CI on pull requests.

Comment on lines +24 to +27
command:
- sh
- -c
- "bash /root/.cache/tests/multi_node_mp/launch_server_leader.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The container's command executes a script from /root/.cache, which appears to be a path on a persistent volume. This is a critical flaw in a CI setup, as it means the CI is running a cached version of the script, not the one from the pull request. Changes to the launch scripts in a PR will not be tested. The same issue exists for the worker template (lines 72-75). The command should execute the script from the CI workspace where the PR's source code is checked out.

            command:
              - sh
              - -c
              - "bash $WORKSPACE/tests/e2e/multi_nodes/multi_node_mp/launch_server_leader.sh"


# Retry until the worker node connects to the head node or the timeout expires.
for (( i=0; i < $ray_init_timeout; i+=5 )); do
ray start --address=$ray_address:$ray_port --block "${start_params[@]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The worker subcommand uses ray start --block, which will cause the script to hang. The --block flag prevents the command from returning until the process is manually terminated. This means the retry loop will never continue, and the script will not function as intended.

Suggested change
ray start --address=$ray_address:$ray_port --block "${start_params[@]}"
ray start --address=$ray_address:$ray_port "${start_params[@]}"

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
local_ip=$(hostname -I | awk '{print $1}')
nic_name=eth0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The network interface name nic_name is hardcoded to eth0. This is not guaranteed to be correct in all environments, including Kubernetes pods, which can lead to script failures if the interface has a different name. It's better to determine the interface name dynamically.

Suggested change
nic_name=eth0
nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}')

#!/bin/bash

local_ip=$(hostname -I | awk '{print $1}')
nic_name=eth0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The network interface name nic_name is hardcoded to eth0. This is brittle and will fail in environments where the interface has a different name. The interface name should be determined dynamically based on the local IP address.

Suggested change
nic_name=eth0
nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}')

Comment on lines +39 to +43
# readinessProbe:
# tcpSocket:
# port: 8080
# initialDelaySeconds: 15
# periodSeconds: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The readinessProbe for the leader container is commented out. Without it, Kubernetes cannot determine when the server is ready to accept requests. This can lead to flaky CI jobs if subsequent steps try to connect to the server before it's fully initialized. A readiness probe is essential for a reliable service.

            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 60
              periodSeconds: 10
              failureThreshold: 30

Signed-off-by: wangli <[email protected]>
Copy link

github-actions bot commented Sep 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant